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As  computers  are  integrated  into  systems  that  have  stringent  fault-tolerance  requirements, 
there  is  a  growing  need  for  techniques  to  establish  that  these  systems  actually  satisfy  those 
requirements.  Informal  arguments  do  not  supply  the  desired  level  of  assurance  for  criti¬ 
cal  systems.  This  dissertation  presents  a  rigorous,  automated  approach  to  analyzing  dis¬ 
tributed  systems,  with  a  focus  on  checking  fault-tolerance  requirements,  and  describes  a 
prototype  implementation  of  the  analysis.  The  analysis  is  a  novel  hybrid  of  ideas  from 
stream-processing  semantics  of  networks  of  processes,  abstract  interpretation  of  programs, 
and  symbolic  computation.  The  underlying  principles  of  the  analysis  method  are  general, 
but  specialized  techniques — such  as  the  use  of  perturbations  to  represent  changes  to  normal 
behavior  caused  by  failures — are  developed  to  deal  efficiently  with  the  types  of  systems  and 
requirements  that  arise  in  establishing  fault-tolerance.  The  method  is  illustrated  with  three 
examples:  the  Oral  Messages  algorithm  for  Byzantine  Agreement,  due  to  Lamport,  Shostak 
and  Pease,  a  standard  protocol  for  FIFO  reliable  broadcast,  and  a  (subtly)  flawed  protocol 
for  fault-tolerant  moving  agents. 


Biographical  Sketch 


Scott  David  Stoller  New  Jersey,  U.S.A.  His  childhood  was, 

in  retrospect,  uneventful.  He  was  graduated  summa  cum  laude  from  Princeton  University 
with  a  Bachelor’s  degree  in  Physics  in  1990.  After  spending  one  year  as  a  graduate  student 
in  the  Physics  Department  at  Cornell,  he  transferred  to  the  Computer  Science  Department. 
In  the  spring  of  1992,  he  took  Anil  Nerode’s  logic  course.  There  he  met  Yanhong  Annie  Liu, 
who  has  made  his  life  richer  in  countless  ways.  In  May  1996,  they  were  married.  In  August 
1996,  they  joined  the  faculty  of  Indiana  University  in  Bloomington. 


To  all  my  teachers, 
especially  my  parents 


Acknowledgements 


I  would  like  to  thank  my  thesis  advisor,  Fred  Schneider,  for  his  guidance,  support,  knowledge, 
and  insights,  which  have  been  invaluable  to  the  research  in  this  thesis  and  beyond.  The 
opportunity  to  work  with  him  has  been  a  pleasure  and  a  privilege. 

I  would  also  like  to  thank  Bob  Constable  for  sharing  his  enthusiasm  for  and  insights  into 
logic,  programming  languages,  my  research  (everywhere  it  turned),  and  many  other  topics. 

I  thank  Annie  Liu  for  her  love,  encouragement,  criticisms,  and  suggestions,  which  have 
contributed  greatly  to  this  work. 

I  would  also  like  to  thank  Rich  Zippel,  Robbert  van  Renesse,  Doug  Howe,  David  Cries, 
Tom  Henzinger,  Stuart  Allen,  Wilfred  Chen,  Lorenzo  Alvisi,  Thomas  Yan,  Pei-Hsin  Ho, 
Yaron  Minsky,  and  Mark  Hayden  for  many  enjoyable  and  thought-provoking  discussions. 

This  material  is  based  on  work  supported  in  part  by  the  NAS  A/ ARP  A  grant  NAG- 2-893, 
AFOSR  grant  F49620-97-1-0013,  and  ARPA/RADC  grant  F30602-96-1-0317.  Any  opinions, 
findings,  and  conclusions  or  recommendations  expressed  in  this  publication  are  those  of  the 
author  and  do  not  reflect  the  views  of  these  agencies. 

Finally,  I  thank  my  parents,  for  their  love  and  for  preparing  me  for  life. 


v 


Table  of  Contents 


1  Introduction  1 

1.1  Establishing  Fault-Tolerance .  1 

1.2  Overview  of  this  Work .  3 

1.3  Outline  of  the  Dissertation .  8 

2  Analyzing  Systems  that  Never  Fail  9 

2.1  Concrete  Model .  10 

2.1.1  Kahn’s  Model  of  Determinate  Systems .  10 

2.1.2  Concrete  Model  of  Non-Determinate  Systems  .  14 

2.1.3  A  Running  Example .  16 

2.2  Representation  of  Runs .  18 

2.2.1  ms-atoms .  20 

2.2.2  Values .  20 

2.2.3  Multiplicity .  24 

2.2.4  Tags .  26 

2.2.5  Message  Ordering .  27 

2.3  Representation  of  Components .  27 

2.3.1  Notation  for  Functions .  28 

2.3.2  Running  Example .  30 

2.4  Semantics  and  Soundness .  33 

2.4.1  Semantics .  34 

2.4.2  Soundness .  38 

2.4.3  Invariants .  39 

2.4.4  Sanity  Conditions  for  Input-Output  Functions .  39 

2.5  Termination  of  Fixed-Point  Calculations .  41 

2.5.1  Monotonicity  of  step .  42 

2.5.2  The  First  Step  .  44 

2.5.3  Finite  Ascending  Chains .  45 

2.5.4  Run  is  not  an  oj-cpo  .  46 

2.6  Sanity  Conditions  for  ms-atoms .  47 

3  Analyzing  Systems  that  Fail  49 

3.1  Fault-Tolerance  Analysis  Without  Perturbations .  49 

3.1.1  Behavior  of  Failure-Prone  Systems  .  49 

3.1.2  Fault- Tolerance  Requirements .  51 


vi 


3.1.3  Running  Example .  51 

3.2  Motivation  for  Changes .  56 

3.2.1  Expressiveness .  56 

3.2.2  Non- Trivial  Relationships  between  Original  and  Perturbed  Values  .  .  .  60 

3.3  Concrete  Model  with  Failures  and  Correlations .  62 

3.3.1  Running  Example .  64 

3.4  Perturbational  Framework:  Representation  of  Runs .  65 

3.4.1  Running  Example .  68 

3.5  Perturbational  Framework:  Representation  of  Components .  68 

3.5.1  Running  Example .  70 

3.5.2  Semantics .  75 

3.5.3  Soundness .  76 

3.5.4  Termination  of  Fixed-Point  Calculations .  78 

4  Two  Classic  Problems  in  Fault- Tolerance  80 

4.1  Reliable  Broadcast .  81 

4.1.1  Reliable  Broadcast  Protocol .  81 

4.1.2  Relationships  Between  Multiplicities  .  83 

4.1.3  Fault-Tolerance  Requirement  .  88 

4.1.4  Input- Output  Functions .  90 

4.1.5  Examples .  93 

4.2  Byzantine  Agreement .  94 

4.2.1  Oral  Messages  Algorithm .  96 

4.2.2  Analysis  of  Perturbed  Behavior .  101 

5  Fault- Tolerance  for  Moving  Agents  105 

5.1  Fault-tolerance  for  Moving  Agents .  105 

5.1.1  Voting  After  Each  Stage .  110 

5.1.2  The  Effects  of  Byzantine  Failures .  113 

5.2  Input-Output  Functions .  116 

5.2.1  Input-Output  Function  for  Servers  .  116 

5.2.2  Other  Input-Output  Functions .  124 

5.3  Analysis  of  Perturbed  Behavior .  125 

5.3.1  Failure  of  Visited  Services .  125 

5.3.2  Failure  of  Unvisited  Services .  126 

5.3.3  Failure  of  Visited  and  Unvisited  Services .  127 

5.4  Discussion .  128 

5.4.1  Symbolic  vs.  Abstract  Values .  128 

5.4.2  Abstracting  from  Paths .  128 

5.4.3  Approximation  of  Message  Extensions  .  129 

6  Related  and  Future  Work  130 

6.1  Related  Work .  130 

6.2  Future  Work .  133 


vii 


A  Index  of  Symbols  137 

B  CRAFT:  A  Tool  for  Fault- Tolerance  Analysis  142 

B.l  Overview .  142 

B.2  Type  Definitions  and  Function  Declarations .  148 

B.3  Using  CRAFT .  154 

Bibliography  155 


viii 


List  of  Figures 


1.1  Example  of  graphical  representation  of  system  behavior .  6 

2.1  Example  of  graphical  representation  of  system  behavior .  11 

2.2  Run  for  running  example .  23 

2.3  Run  for  two-stage  replicated  pipeline  when  Fi  suffers  a  Byzantine  failure.  .  26 

3.1  Impossibility  example  1:  failure-free  behavior .  58 

3.2  Impossibility  example  1:  faulty  run .  58 

3.3  Impossibility  example  2:  failure-free  behavior .  59 

3.4  Impossibility  example  2:  faulty  run .  59 

3.5  Failure-free  behavior  of  system  with  ECC .  61 

3.6  Idealized  behavior  of  system  with  median .  62 

3.7  Run  for  running  example  when  component  J:\  fails .  68 

3.8  Definition  of  ballotp .  73 

3.9  Definition  of  Voter Fc .  74 

4.1  Failure-free  behavior  of  the  reliable  broadcast  protocol .  83 

4.2  Initial  behavior  of  the  reliable  broadcast  protocol  when  Si  crashes .  86 

4.3  Behavior  of  the  reliable  broadcast  protocol  when  S\  crashes .  87 

4.4  Behavior  of  the  reliable  broadcast  protocol  when  Si  crashes .  89 

4.5  Definition  of  server .  91 

4.6  Definition  of  mulRB .  93 

4.7  Behavior  of  the  reliable  broadcast  protocol  when  Si  and  S2  crash .  94 

4.8  Failure-free  behavior  of  the  Oral  Messages  algorithm .  97 

4.9  Definition  of  relay ,  with  two  auxiliary  functions .  99 

4.10  Behavior  of  the  Oral  Messages  Algorithm  when  L2  is  faulty .  102 

4.11  The  run  stepF(nfBAJsL)(±Run) .  103 

4.12  Behavior  of  the  Oral  Messages  Algorithm  when  C  is  faulty . 104 

5.1  Run  of  replicated  two-stage  moving  agent .  106 

5.2  Run  of  replicated  two-stage  moving  agent,  with  authentication . Ill 

5.3  Run  of  replicated  two-stage  moving  agent,  with  authentication  and  with  vot¬ 
ing  after  each  stage .  112 

5.4  Definition  of  server i .  118 

B.l  Window  for  entering  information  about  a  new  component .  145 

B.2  Result  of  analysis  in  absence  of  failures .  146 


IX 


B.3  Result  of  analysis  when  component  FI  suffers  a  value  failure . 147 

B.4  CAML  type  definitions  corresponding  to  Val  and  Mul .  150 

B.5  CAVIL  type  definitions  corresponding  to  A  Val,  A  Mul,  L ,  and  LFC . 151 

B.6  CAML  type  definitions  corresponding  to  Run,  IOF ,  RunFc ,  and  IOFfc ,  plus 

miscellaneous  other  CAVIL  type  definitions  and  function  declarations.  .  .  .  153 


x 


Chapter  1 


Introduction 


As  computers  are  integrated  into  systems  that  have  stringent  fault-tolerance  requirements, 
there  is  a  growing  need  for  techniques  to  establish  that  these  systems  actually  satisfy  those  re¬ 
quirements.  These  systems  include  safety-critical  systems,  from  digital  flight  control  systems 
to  factory  automation  systems,  and  business-critical  systems,  from  traditional  applications, 
like  distributed  databases,  to  nascent  ones,  like  systems  for  electronic  commerce.  Moreover, 
as  networks  of  workstations  (NoWs)  become  an  increasingly  popular  platform  for  large-scale 
computations  of  all  kinds,  fault-tolerance  is  becoming  an  issue  for  general-purpose  comput¬ 
ing. 

Establishing  fault-tolerance  of  a  system  involves  three  major  steps:  (1)  identifying  pos¬ 
sible  failures  of  each  component,  (2)  determining  requirements  on  system  behavior  for  each 
combination  of  component  failures,  and  (3)  checking  whether  system  behavior  actually  sat¬ 
isfies  these  requirements.  Step  3  is  sometimes  tackled  with  informal  arguments.  However, 
informal  arguments  do  not  supply  the  desired  level  of  assurance  for  critical  systems.  This 
thesis  presents  a  rigorous,  automated  approach  to  analyzing  distributed  systems,  with  a  focus 
on  checking  fault-tolerance  requirements.  The  underlying  principles  of  the  analysis  method 
are  general,  but  specialized  techniques  are  developed  to  deal  efficiently  with  the  types  of 
systems  and  requirements  that  arise  in  establishing  fault-tolerance. 

1.1  Establishing  Fault- Tolerance 

The  first  step  in  establishing  fault-tolerance  of  a  system  is  to  identify  possible  failures  of  each 
component.  Each  failure  corresponds  to  one  way  in  which  a  component’s  actual  behavior 
might  diverge  from  its  normal  (specified)  behavior.  For  example,  one  commonly-considered 
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failure  of  processors  is  the  crash  failure,  which  causes  the  processor  to  halt.  A  more  severe 
failure  is  the  Byzantine  failure,  which  causes  the  processor  to  execute  an  arbitrary  sequence 
of  instructions,  unrelated  to  the  program  it  would  have  executed  in  the  absence  of  the  failure. 
Combinations  of  failures  that  might  occur  in  a  system  are  called  failure  scenarios ;  thus,  a 
failure  scenario  for  a  system  is  simply  an  assignment  of  a  failure  to  each  component  of  that 
system.  Since  components  sometimes  (hopefully  most  of  the  time! )  do  not  fail,  we  introduce 
a  special  failure  called  OK .  corresponding  to  absence  of  failure;  in  other  words,  OK  indicates 
that  the  divergence  from  normal  behavior  is  nothing  (or  “zero”). 

The  second  step  is  to  determine  requirements  on  system  behavior  in  the  failure  scenarios. 
Fault-tolerance  requirements  can  be  expressed  as  a  function  b  such  that,  for  each  failure 
scenario  fs  of  the  system,  b(fs)  is  a  condition  that  the  system’s  behavior  should  satisfy  in 
that  failure  scenario.  For  example,  an  aircraft  control  system  might  be  required  to  provide 
normal  service  despite  the  Byzantine  failure  of  any  one  component.  More  precisely,  for  every 
failure  scenario  in  which  at  most  one  component  is  faulty,  the  signals  sent  by  the  control 
system  to  the  actuators  should  be  the  same  as  if  no  failures  had  occurred. 

The  third  step  is  to  check  whether  a  system  will  satisfy  its  fault-tolerance  requirements. 
Experience  has  shown  that  informal  arguments  about  fault-tolerance  are  error-prone  and  do 
not  supply  the  desired  level  of  assurance  for  critical  systems  [ORSvH95].  For  example,  one 
might  guess  that  replication  and  voting  is  an  easy  (albeit  expensive)  way  to  achieve  fault- 
tolerance.  However,  the  extensive  literature  on  Byzantine  agreement,  and  errors  like  the  one 
described  in  [LR93],  show  that  efficiently  coordinating  non-faulty  replicas  in  the  presence  of 
arbitrary  behavior  by  faulty  replicas  is  a  difficult  problem. 

The  difficulty  of  analyzing  fault-tolerance  by  informal  methods  has  inspired  the  devel¬ 
opment  and  application  of  rigorous  methods.  One  approach  is  to  apply  general-purpose 
proof-based  verification  techniques.  Work  on  SIFT  [W+78],  an  aircraft  control  computer,  is 
a  classic  example;  indeed,  this  is  one  of  the  earliest  applications  of  any  rigorous  approach 
to  fault-tolerance.  Mechanized  support,  in  the  form  of  a  theorem-proving  system,  typically 
helps  to  manage  the  large  and  complex  proofs.  The  use  of  these  general-purpose  verifica¬ 
tion  tools  offers  an  attractive  conceptual  economy.  However,  most  people  who  design  and 
validate  fault-tolerant  systems  are  not  experts  in  mathematical  logic  or  formal  verification, 
so  methods  that  require  them  to  construct  large  proofs  (even  with  support  from  a  theorem¬ 
proving  system)  are  problematic.  Proof  techniques  designed  specifically  for  verification  of 
fault-tolerance  have  been  proposed  [CdR93,Web93,PJ94,Sch94].  These  techniques  do  facili¬ 
tate  proofs  of  fault-tolerance,  but  still  require  considerable  logical  expertise  of  the  user. 
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Automated  verification  techniques  have  received  increasing  attention  in  recent  years, 
largely  as  a  result  of  advances  in  temporal-logic  model-checking  [CGL94]  and  automata- 
and  process-based  verification  techniques  [Hol91,Ivur94,CS96].  The  techniques  are  largely 
based  on  exhaustive  exploration  of  finite  state  spaces.  They  are  particularly  well-suited  to 
hardware  verification  and  have  been  applied  predominantly  thereto.  Relatively  little  work 
has  been  done  on  automated  analysis  of  fault-tolerant  systems,  partly  because  the  protocols 
of  interest  are  more  typical  of  software  than  hardware,  and  exhaustive  search  of  the  state 
space  of  interesting  software  systems  is  often  infeasible. 

1.2  Overview  of  this  Work 

This  thesis  explores  a  specialized  approach  to  analysis  of  distributed  systems,  focusing  on 
fault-tolerance  properties.  Our  approach  is  not  based  on  exhaustive  state-space  exploration. 
Instead,  it  is  a  novel  hybrid  of  ideas  from  stream-processing  (or  data-flow)  semantics  of  net¬ 
works  of  processes  [Kah74,Bro87,Bro90],  abstract  interpretation  of  programs  [AH87],  and 
symbolic  computation.  An  important  feature  of  our  approach  is  that  flexible  and  power¬ 
ful  abstraction  mechanisms  are  incorporated  directly  into  the  framework.1  Having  these 
mechanisms  plays  a  crucial  role  in  making  fault-tolerance  analysis  tractable. 

Our  use  of  abstraction  aims  to  exploit  separation  of  concerns:  analysis  of  failures  is 
separated  as  much  as  possible  from  other  aspects  of  system  analysis.  To  facilitate  this 
separation  of  concerns,  the  analysis  is  parameterized  by  possible  occurrences  of  failures,  and 
system  behavior  is  analyzed  separately  for  different  failure  scenarios.  A  different  and  more 
common  (e.g.,  [LJ92,CdR93,Web93,PJ94,LM94])  approach  is  to  model  failures  as  events 
that  occur  non-deterministically  during  a  computation;  however,  this  makes  it  difficult  to 
separate  the  effects  of  failures  from  other  aspects  of  system  behavior  and  hence  to  model  the 
former  more  finely  than  the  latter. 

Our  analysis  methods  rest  on  a  separation  of  concerns  in  specifications:  fault-tolerance 
requirements  are  separated  as  much  as  possible  from  other  correctness  requirements.  This 
separation  allows  the  analysis  to  ignore  aspects  of  the  system  that  do  not  directly  impact 
its  fault-tolerance.  For  example,  in  a  system  with  replicated  processors,  detailed  analysis  of 
how  the  results  from  different  replicas  are  combined  (e.g.,  voted)  may  be  needed,  but  other 
aspects  of  the  processing  (e.g.,  the  particular  state  machine  implemented  by  each  replica) 

1Follo'vving  literature  on  abstract  interpretation  (e.g.,  [AH87])  and  program  refinement  (e.g.,  [KMP94]), 
we  use  “abstraction”  in  the  sense  of  “approximation” .  This  has  little  to  do  with  the  meaning  of  “abstraction” 
in  the  theory  of  functional  programming  languages  (e.g.,  [Rea89])  or  abstract  data  types. 


4 


are  treated  in  our  approach  by  coarse  approximations.  Such  abstraction  is  crucial  for  making 
the  analysis  tractable. 

Our  analysis  uses  a  fixed-point  calculation  to  determine  three  kinds  of  information  that, 
together,  characterize  system  behavior: 

Values:  The  data  sent  in  messages. 

Multiplicities:  The  number  of  times  each  value  is  sent. 

Orderings:  The  order  in  which  values  are  sent. 

Values  and  their  multiplicities  are  approximated  using  abstract  values ,  each  representing  a  set 
of  possible  values,  as  in  abstract  interpretation  [AH87].  We  also  use  symbolic  values ,  which 
are  expressions  composed  of  constants  and  variables,  to  capture  additional  relationships 
between  values.  Orderings  are  approximated  by  allowing  partial  orders,  rather  than  just  total 
orders.  This  support  for  approximation  of  all  three  kinds  of  information  allows  irrelevant 
aspects  of  a  system  to  be  suppressed  and  allows  compact  representation  of  the  highly  non- 
deterministic  behavior  that  failures  can  cause. 

For  example,  suppose  one  process  of  a  system  sends  to  another  process  a  message  con¬ 
taining  a  number,  then  possibly  sends  a  second  message  containing  the  same  number.  The 
data  in  the  first  message  could  be  represented  using  abstract  value  N,  representing  the  nat¬ 
ural  numbers,  and  symbolic  value  X ,  where  X  is  a  variable  that  denotes  the  actual  value 
sent.  The  multiplicity  of  the  first  message  is  1,  which  is  (overloaded  as)  an  abstract  value 
that  represents  only  the  number  one;  since  this  abstract  multiplicity  determines  the  multi¬ 
plicity  uniquely,  it  doesn’t  matter  what  symbolic  multiplicity  is  used.  The  second  message 
message  would  be  represented  using  the  same  abstract  value  N  and,  to  show  that  the  values 
in  the  two  messages  are  equal,  the  same  symbolic  value  X.  The  multiplicity  of  the  second 
message  is  ?,  which  is  an  abstract  value  that  we  define  to  represent  zero  or  one;  the  symbolic 
multiplicity  might  be  a  different  variable. 

We  require  that  all  approximations  used  in  modeling  a  component  be  conservative ,  i.e., 
they  must  over-estimate,  rather  than  under-estimate,  the  component’s  possible  behaviors. 
The  use  of  conservative  approximations  ensures  that  the  information  (values,  multiplicities, 
and  orderings)  determined  by  the  analysis  includes  all  possible  behaviors  of  the  system 
being  analyzed.  But  because  approximations  are  being  used,  the  values,  multiplicities,  and 
orderings  may  represent  other  behaviors  as  well.  Thus,  analyzing  the  approximation — rather 
than  the  system  it  approximates — never  gives  false  positives  but  may  give  false  negatives.  In 
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other  words,  the  analysis  may  fail  to  show  that  all  of  a  system’s  possible  behaviors  satisfy  a 
fault-tolerance  requirement,  even  if  they  do.  The  possibility  of  false  negatives  is  an  inevitable 
consequence  of  the  approximations  that  enable  efficient  and  automated  analysis  of  systems 
with  intractably  large  state  spaces. 

To  support  efficient  and  convenient  fault-tolerance  analysis,  we  extend  the  analysis  frame¬ 
work  sketched  above — hereafter  called  the  non-perturbational  framework — to  obtain  the  per- 
turbational  framework ,  in  which  perturbations  (changes)  due  to  failures  are  made  explicit. 
Here,  the  effects  of  a  failure  are  represented  as  changes  to  the  original  outputs  of  the  faulty 
component.  Changes  to  the  outputs  of  a  component  change  the  inputs  to  components  that 
use  those  outputs,  and  changes  to  a  component’s  inputs  generally  cause  changes  to  its  out¬ 
puts.  Each  component  is  characterized,  in  part,  by  how  it  propagates  changes.  For  example, 
one  might  normally  describe  the  behavior  of  a  majority- voter  as  follows:  if  a  majority  of 
its  inputs  are  equal,  then  its  output  is  the  majority  value  among  its  inputs.  Intuitively, 
the  justification  for  using  a  majority- voter  to  mask  the  effects  of  failures  corresponds  more 
directly  to  the  fact  that  a  majority- voter  propagates  changes  to  its  inputs  consistently  with 
the  rule:  if  the  inputs  are  originally  equal  and  at  most  a  minority  of  them  change,  then  the 
output  is  unchanged. 

For  example,  consider  the  graph  in  Figure  1.1,  which  we  use  to  represent  the  behavior 
of  a  two-stage  replicated  pipeline  having  one  faulty  component  F\.  The  meaning  of  such 
graphs  is  defined  formally  in  Chapter  3;  here  we  give  only  an  informal  explanation.  The 
nodes  of  the  graph  correspond  to  components.  Edge  (x,  y)  is  labeled  with  a  representation 
of  messages  sent  from  x  to  y.  The  graph  represents  both  the  failure-free  behavior  of  the 
system  and  the  behavior  in  a  specified  failure  scenario — here,  the  failure  scenario  in  which 
Fi  suffers  a  Byzantine  failure,  and  the  other  components  do  not  fail.  In  figures,  dots  on  the 
circumference  of  a  node  indicate  a  failure  other  than  OK  for  that  component;  the  specific 
failure  is  identified  in  the  text  (as  in  the  previous  sentence). 

Edges  are  labeled  with  ms-atoms ,  which  represent  sets  of  messages.  There  are  two  kinds 
of  ms-atoms:  perturbed  ms-atoms,  which  contain  square  brackets,  and  new  ms-atoms,  which 
do  not.  Each  perturbed  ms-atom  contains  two  parts:  an  original  part  (preceding  the  square 
brackets)  and  a  perturbation  (enclosed  in  square  brackets).  In  this  figure,  the  edge  from 
Fi  to  S  has  a  new  ms-atom;  the  other  edges  all  have  perturbed  ms-atoms.  The  failure-free 
behavior  of  the  system  is  represented  by  the  original  parts  of  the  perturbed  ms-atoms.  Thus, 
we  see  that  the  source  S  sends  a  natural  number,  represented  by  the  symbolic  value  X,  to 
processors  F\  F?> .  Each  of  these  processors  applies  a  function,  represented  by  the  symbol 
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F.  to  its  input  and  sends  the  result  to  the  corresponding  processor  in  the  next  stage  of  the 
pipeline.  Processors  G\-Gz  apply  a  function  represented  by  G  to  their  input  and  send  the 
result,  namely  G(F(X)),  to  a  3-input  voter  V ,  which  selects  the  majority  of  its  inputs  and 
sends  the  result  to  actuator  A. 


Figure  1.1:  Example  of  graphical  representation  of  system  behavior. 


The  changes  to  this  behavior  caused  by  failure  of  Fi  are  represented  by  the  perturbations 
in  the  perturbed  ms-atoms  as  well  as  by  the  new  ms-atoms,  which  represent  messages  having- 
no  analogue  in  the  failure-free  behavior  of  the  system.  Thus,  the  graph  represents  the  case 
where  fails  by  sending  perturbed  messages  to  Gj  and  new  messages  to  S.  The  symbol 
Tav  in  a  ms-atom  denotes  an  arbitrary  change  to  the  data  sent  in  the  message,  and  the 
superscript  *a  denotes  an  arbitrary  change  to  the  multiplicity.  Thus,  the  graph  of  Figure  1.1 
reflects  that  when  F\  fails,  it  might  send  an  arbitrary  number  of  arbitrary  messages  to  (Sj . 
The  abstract  value  Ty  represents  all  concrete  values;  thus,  when  ij  fails,  it  might  also  send 
an  arbitrary  number  of  arbitrary  messages  to  S.  Since  (Si’s  input  is  perturbed  arbitrarily,  so 
its  output  is  function  G  applied  to  an  arbitrary  value.  Without  specific  information  about 
function  G,  the  result  of  the  application  is  itself  just  an  arbitrary  value,  so  the  output  of  G 
is  also  perturbed  arbitrarily.  As  before,  symbol  Tav  represents  an  arbitrary  perturbation. 
The  output  of  the  voter  is  not  perturbed — empty  square  brackets  in  a  perturbed  ms-atom 
are  used  to  denote  “no  change”.  So,  if  the  fault-tolerance  requirement  for  this  system  is  that 
the  input  to  the  actuator  is  unchanged  in  this  failure  scenario,  then  this  graph  states  that 
the  system  satisfies  this  requirement. 

Including  perturbations  in  ms-atoms  allows  the  sensitivity  of  a  component  to  perturba¬ 
tions  of  its  inputs  to  be  expressed  directly.  Without  this  notation,  this  information  sometimes 
would  have  to  be  encoded  awkwardly  in  the  values  and  multiplicities,  and  sometimes  would 
not  be  expressible  at  all.  For  example,  consider  again  the  system  shown  in  Figure  1.1.  In  a 
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framework  without  explicit  perturbations,  there  is  no  way  to  express  whether  or  not  the  new 
messages  sent  from  F\  to  S  confuse  the  source  and  cause  changes  to  the  data  that  S  sends 
to  F2  and  F3.  The  problem  is  that  even  if  the  data  sent  by  S  changes,  it  is  still  represented 
by  the  symbolic  value  X.  In  the  graph  in  Figure  1.1,  the  square  brackets  on  the  source’s 
outputs  are  empty;  this  states  that  outputs  from  S  are  indeed  unchanged  by  the  additional 
messages  sent  from  to  S. 

The  use  of  explicit  perturbations  also  allows  fault-tolerance  requirements  to  be  conve¬ 
niently  expressed  as  constraints  on  the  acceptable  perturbations  to  system  behavior.  Recall 
that  in  the  non-perturbational  framework,  the  condition  associated  with  a  failure  scenario 
is  simply  a  predicate  that  the  system  behavior  in  that  failure  scenario  must  satisfy.  In  the 
perturbational  framework,  the  condition  associated  with  a  failure  scenario  is  allowed  to  be  a 
relation  that  must  hold  between  the  system’s  failure-free  behavior  and  its  behavior  in  that 
failure  scenario.  For  example,  a  typical  fault-tolerance  requirement  is  that  inputs  to  certain 
components  are  unchanged  in  certain  failure  scenarios;  a  weaker  requirement  is  that  inputs 
to  those  components  be  either  unchanged  or  absent. 

Explicit  perturbations  may  be  unfamiliar  in  fault-tolerance  analysis,  but  they  are  analo¬ 
gous  to  familiar  techniques  for  analysis  of  numerical  error  [Sca62],  Error  analysis  focuses  on 
how  numerical  errors  introduced  in  one  part  of  the  computation  are  propagated  by  subse¬ 
quent  computation.  Analogously,  fault-tolerance  analysis  with  explicit  perturbations  focuses 
on  how  perturbations  introduced  by  failures  are  propagated  during  subsequent  execution  of 
the  system.  Note  that  the  separation  of  error  analysis  from  other  aspects  of  correctness  of 
a  numerical  computation  is  analogous  to  separation  of  fault-tolerance  analysis  from  other 
correctness  concerns. 

Feasibility  of  the  Approach.  The  computational  complexity  of  our  analysis  method 
depends  on  the  number  of  failure  scenarios  for  which  the  analysis  must  be  performed  and 
on  the  cost  of  the  analysis  for  each  failure  scenario.  The  number  of  failure  scenarios  some¬ 
times  can  be  reduced  by  use  of  symmetry  arguments  (if  one  failure-scenario  is  a  symmetric 
variant  of  another)  and  by  abstracting  from  the  timing  of  failures  (making  each  failure 
scenario  specify  less  precisely  when  the  failure  occurs).  The  cost  of  the  analysis  for  one 
failure  scenario  depends  largely  on  the  complexity  of  the  fault -tolerance  mechanisms — since 
their  behavior  must  be  analyzed — and  on  the  extent  to  which  separation  of  concerns  can 
be  achieved,  i.e. ,  the  extent  to  which  other  mechanisms  in  the  system  can  be  ignored  in 
the  analysis.  Complex  fault-tolerance  mechanisms — for  example,  protocols  involving  many 
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rounds  of  communication — will  take  longer  to  analyze  than  simpler  ones,  but  this  seems 
inevitable.  Separation  of  concerns  is  a  design  principle  underlying  many  fault-tolerant  sys¬ 
tems,  so  achieving  a  similar  separation  of  concerns  in  the  analysis  is  a  natural  goal.  The 
key  is  to  find  approximations  that  are  coarse  enough  for  tractability  and  precise  enough  to 
validate  interesting  systems  (i.e. ,  precise  enough  not  to  yield  false  negatives). 

The  practical  utility  of  our  (or  any)  approach  to  fault-tolerance  analysis  can  be  deter¬ 
mined  only  by  trying  it  on  a  wide  range  of  fault-tolerant  systems.  Therefore,  not  only  does 
this  thesis  introduce  an  approach,  but  it  examines  the  applicability  of  that  approach  to  some 
important  classes  of  fault-tolerant  systems.  Classic  algorithms  for  Byzantine  agreement  and 
reliable  broadcast  are  analyzed.  New  protocols  for  fault-tolerant  computation  with  moving 
agents  are  also  analyzed.  These  analyses  were  performed  using  a  prototype  tool,  described 
in  Appendix  B,  that  implements  the  analysis  and  provides  a  graphical  interface  to  it. 

1.3  Outline  of  the  Dissertation 

Chapter  2  presents  a  framework  that  incorporates  abstraction  mechanisms  but  ignores  fail¬ 
ures.  Chapter  3  extends  that  framework  for  fault-tolerance  analysis.  First,  a  non-perturbational 
framework  is  described,  in  which  each  component  is  parameterized  by  its  possible  failures, 
allowing  the  component’s  behavior  to  be  described  separately  for  each  possible  failure.  Limi¬ 
tations  of  this  approach  are  discussed,  motivating  the  introduction  of  explicit  perturbations, 
which  enable  expression  of  additional  correlations  between  the  failure- free  and  faulty  behav¬ 
iors  of  a  component. 

Chapter  4  analyzes  algorithms  for  two  classic  problems  in  fault-tolerance:  Byzantine 
agreement  [LSP82]  and  reliable  broadcast  [HT94],  These  examples  illustrate  analysis  of 
systems  subject  to  Byzantine  failures  and  crash  failures,  respectively.  Chapter  5  applies  our 
framework  to  fault-tolerant  moving  agents,  a  more  recent  concern.  This  illustrates  how  our 
analysis  methods  can  be  applied  to  protocols  that  employ  cryptographic  techniques. 

Chapter  6  discusses  related  work  and  presents  several  ideas  for  future  work.  Appendix  A 
contains  an  index  of  symbols.  Appendix  B  describes  our  prototype  implementation  of  and 
graphical  interface  for  the  analysis. 


Chapter  2 


Analyzing  Systems  that  Never  Fail 


Our  system  model  is  based  on  Gilles  Kalin’s  stream-processing  model  of  parallel  and  dis¬ 
tributed  systems  [Ivah74].  For  concreteness,  we,  like  Kalin,  consider  systems  of  components 
that  communicate  through  asynchronous  FIFO  channels.  However,  this  assumption  is  not 
essential  to  our  approach. 

The  system  model  presented  in  this  chapter  differs  from  other  descendants  of  Kahn’s 
model  mainly  by  having  mechanisms  that  allow  approximation  of  system  behavior.  The 
traditional  purpose  of  stream-processing  models  is  to  provide  compositional  semantics  for 
networks  of  communicating  processes,  thereby  serving  as  the  foundation  of  compositional 
frameworks  for  (manually)  proving  properties.  In  contrast,  our  goal  is  to  develop  a  foundation 
for  automated  analysis.  Approximations  are  necessary  to  make  this  analysis  tractable.  The 
approximations  we  use  abstract  from  aspects  of  a  system  that  do  not  directly  impact  fault- 
tolerance. 

Section  2.1  presents  a  model  of  concrete  processes.  Section  2.2  describes  abstraction 
mechanisms  used  to  approximate  a  system’s  behavior.  Section  2.3  discusses  the  abstract 
representation  of  processes  and  defines  the  fixed-point  analysis  of  system  behavior.  Section 
2.4  relates  the  two  models.  It  shows  soundness  of  the  fixed-point  analysis:  if  an  appropriate 
fixed-point  exists  at  the  abstract  level,  then  that  fixed-point  represents  all  possible  behaviors 
of  the  system  of  processes.  Finally,  Section  2.5  discusses  conditions  under  which  a  fixed-point 
is  guaranteed  to  exist  and  under  which  iterative  calculation  of  the  fixed-point  is  guaranteed 
to  terminate. 
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2.1  Concrete  Model 

Kahn’s  original  model  [Kah74]  handles  only  determinate  systems,  systems  in  which  each 
component  is: 

Deterministic:  For  each  possible  sequence  of  inputs,  the  component  has  exactly  one 
possible  sequence  of  outputs. 

Strict:  The  input  ( “receive” )  primitives  are  restricted  so  that  at  any  time,  a  component 
can  be  waiting  to  receive  a  message  from  at  most  one  component.  This  ensures  that 
components  are  insensitive  to  non-determinism  in  message  reception  order  caused  by 
an  asynchronous  FIFO  network. 

After  summarizing  Kahn’s  model,  we  present  an  extension  based  on  [Bro87,Bro90]  that 
eliminates  these  restrictions. 

2.1.1  Kahn’s  Model  of  Determinate  Systems 

A  system  is  a  collection  of  communicating  components.  Each  component  is  represented  by  a 
function  describing  its  input-output  behavior.  An  input-output  function  takes  as  argument 
the  sequences  of  messages  received  by  a  component  (during  some  computation)  and  returns 
the  sequences  of  messages  sent  by  that  component  as  a  result  of  receiving  those  messages. 
An  input-output  function  is  sometimes  called  a  stream-processing  function  [BD92],  stream 
transformer  [DS89],  or  history  function  [BA81],  because  it  maps  a  stream  of  input  messages 
(the  input  history)  to  a  stream  of  output  messages  (the  output  history). 

If  we  depict  the  behavior  of  a  system  by  a  graph  in  which  each  node  corresponds  to  a 
component  and  each  edge  is  labeled  with  the  sequence  of  messages  sent  from  the  source  to 
the  target,  then  an  input-output  function  for  a  component  takes  as  argument  the  sequences 
of  messages  labeling  the  inedges  of  the  node  corresponding  to  that  component  and  returns 
the  sequences  of  messages  labeling  the  outedges  of  that  node.  Figure  2.1  shows  such  a  graph. 
Note  that  double  angle  brackets  (((•)))  construct  sequences.  Component  C  receives  inputs 
from  components  A  and  B.  Applying  the  input-output  function  associated  with  C  to  those 
input  sequences  yields  the  sequences  shown  on  the  edges  from  C  to  D  and  E.  In  this  case, 
the  input-output  function  associated  with  C  happens  to  forward  messages  from  A  to  D  and 
from  B  to  E. 

These  ideas  are  formalized  as  follows.  A  system  comprises  a  set  of  named  components. 
The  names  serve  as  addresses  for  specifying  the  recipient  of  a  message.  Let  Name  denote 
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Figure  2.1:  Example  of  graphical  representation  of  system  behavior. 

the  names  of  the  system  components.  For  example,  for  the  system  pictured  in  Figure  2.1, 
Name  =  {A,  B.C,  D,  E}.  Let  CVal  (mnemonic  for  “Concrete  Values” )  be  the  set  of  values 
that  can  be  transmitted  in  messages.  A  (concrete)  history  (of  the  messages  sent  along  a 
channel)  has  signature 

CHist  A  Name  •  Scq(CVal),  (2.1) 

where  =  means  “equal  by  definition”,  and  for  any  set  S,  Seq(S)  is  the  set  of  finite  and 
infinite  sequences  of  elements  of  S  (with  zero-based  indexing).  Also,  for  any  sets  S  and  T,  a 
function  of  signature  S  — >  T  is  a  total  function  from  S  to  T.  When  a  history  ch  is  regarded 
as  the  input  to  a  component  x,  ch(y)  is  the  sequence  of  messages  sent  by  y  to  x:  when  ch  is 
regarded  as  the  output  of  a  component  x,  ch(y)  is  the  sequence  of  messages  sent  by  x  to  y. 
For  example,  in  Figure  2.1,  the  input  history  of  component  C  is 

cho  =  (Xx:  Name.  if  x  =  A  then  ((27,35))  (2.2) 

else  if  x  =  B  then  ((10,  6,  4)) 
else  e), 

where  £  is  the  empty  sequence,  and  the  output  history  of  component  C  is 

(Xx:  Name,  if  x  =  D  then  ((27,35)) 

else  if  x  =  E  then  ((10,  6,  4)) 
else  e). 

Partial  ordering  <  cmst  011  CHist  is  the  pointwise  extension  of  the  prefix  ordering  <Seq 
on  Sequences.  More  explicitly, 

chi  < cmst  ch2  =  (Vx  E  Name  :  chi(x)  <Seq  ch2(x)).  (2.3) 
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Each  component  of  a  system — whether  a  piece  of  hardware,  a  process  (in  the  usual  sense), 
a  moving  agent,  etc. — is  represented  as  a  process.  A  determinate  process  is  a  function  with 
signature 

DProcess  =  CHist  CHist ,  (2.4) 

where  the  two-headed  arrow  indicates  a  restriction  to  monotonic  and  continuous  functions. 
The  argument  of  the  function  contains  the  input  sequences,  and  the  result  contains  the 
corresponding  output  sequences.  Informally,  the  restriction  to  monotonic  functions  ensures 
that  providing  additional  inputs  to  a  component  can’t  cause  the  component  to  produce 
fewer  outputs  (i.e. ,  the  component  can’t  “take  back”  outputs  it  already  emitted).  Continuity 
ensures  that  a  component  can’t  produce  outputs  only  after  receiving  an  infinite  sequence  of 
inputs. 

A  system  is  represented  by  a  function  np  G  Name  — >  DProcess  ( np  is  mnemonic  for 
“name  — >  process”),  which  describes  how  each  component  of  the  system  behaves.  And,  a 
run  of  a  system  is  represented  by  a  function  with  signature 

CRun  =  Name  — >■  CHist.  (2.5) 

For  cr  €  CRun ,  we  adopt  the  convention  that  cr(x )  is  the  input  history  of  component  x  in 
the  run,  i.e.,  cr(x)(y)  is  the  sequence  of  messages  sent  to  x  by  y.  For  example,  the  graph  in 
Figure  2.1  corresponds  to  the  run 

(Xx:  Name,  if  x  =  C  then  ch0 

else  if  x  =  D  then  (A y.Name.  if  y  =  C  then  ((27,35))  else  e) 
else  if  x  =  E  then  (A y.Name.  if  y  =  C  then  ((10,6,4))  else  e) 
else  (A y.Name.  e)), 

where  ch0  is  defined  by  (2.2). 

The  run  representing  the  behavior  of  a  system  np  of  determinate  processes  is  defined 
as  follows.  For  a  run  cr  G  CRun ,  the  input  history  of  component  x  is  cr(x).  The  output 
history  of  x  on  those  inputs  is  np(x)(cr(x)).  Taken  together,  these  new  output  histories  for 
each  component  define  new  input  histories  for  each  component;  specifically,  these  new  input 
histories  form  the  run 

step(np)(cr )  =  (Xy:  Name.  (Xx :  Name.  np(x)(cr(x))(y))).  (2.6) 

For  a  run  cr  that  represents  a  complete  execution  of  the  system,  this  new  run  must  equal 
the  run  cr  that  we  started  with,  since  the  output  histories  defined  by  cr  must  already  reflect 
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processing  of  all  messages  in  the  input  histories  defined  by  cr.  Thus,  the  behavior  of  the 
system  is  represented  by  a  run  cr  satisfying 

(V:r  G  Name  :  ( My  G  Name  :  cr(y)(x )  =  np(x)(cr(x))(y )));  (2.7) 

more  precisely,  it  is  represented  by  the  least  such  run,  where  the  ordering  <  cRun  on  CRun  is 
the  pointwise  extension  of  the  ordering  <cmst  on  concrete  histories  [Kah74],  Equivalently, 
this  run  can  be  characterized  as  the  least  fixed-point  of  step(np)  G  CRun  — *  CRun }  Thus, 
the  run  representing  the  behavior  of  system  np  is 

crun(np)  =  lfp  (step(np)).  (2.8) 

This  model  has  two  main  attractions  for  us,  besides  its  elegance.  First,  input-output 
functions  provide  abstraction.  Just  as  an  abstract  data  type  hides  internal  details  of  objects, 
an  input-output  function  hides  the  local  state  and  internal  computations  used  to  implement 
the  component,  describing  only  the  externally  visible  behavior  of  the  component.  In  our 
framework,  this  abstraction  provides  a  convenient  separation  of  local  and  global  analyses. 
Local  analysis  is  done  on  each  component  to  determine  an  input-output  function  that  rep¬ 
resents  its  behavior.  The  global  analysis,  embodied  as  a  fixed-point  calculation,  determines 
the  entire  system’s  behavior — the  possible  histories  of  messages. 

Second,  defining  a  system’s  behavior  as  a  fixed-point  facilitates  verification  and  simula¬ 
tion.  It  facilitates  verification  by  allowing  powerful  induction  rules  in  proving  properties  of 
the  fixed-point.  It  facilitates  simulation  by  providing  a  simple  and  effective  procedure  for 
computing  (finite  prefixes  of)  the  system’s  behavior. 

The  computation  of  least  fixed-points  is  based  on  the  following  definitions  and  classic 
theorem. 

Basic  Fixed-Point  Theory.  The  upper-closure  of  an  element  x  of  a  partial  order  (S,  <$) 
is  {y  G  S  |  x  <s  y}-  Note  that  single  angle  brackets  ((•))  construct  tuples.  We  often  omit 
the  ordering  <s  when  it  is  obvious  from  context.  A  chain  of  a  partial  order  (S,  <s)  is  an 
increasing  sequence  of  elements  of  S;  thus,  the  set  of  chains  of  a  partial  order  is  given  by 

Chain((S ,  <s))  =  {cr  G  Seq(S)  |  (Vi  G  ( dom(cr )  \  {0})  :  a[i  —  1]  <s  cr[i])}.  (2.9) 

where  for  a  sequence  cr,  |cr|  is  the  length  of  cr  and  dom(a)  is  the  index  set  of  cr,  i.e. ,  dom(cr)  = 
{i  G  N  |  i  <  | cr | } .  An  LO-chain  is  a  chain  of  length  oj.  Let  ((g(i)))ie n  denote  the  sequence 

_ ([9(0),  9(1),  9(2), ...)), 

L4s  discussed  below,  existence  of  this  fixed-point  follows  from  Theorem  2.1. 
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where  N  denotes  the  natural  numbers.  An  c o -complete  partial  order  (abbreviated  w-cpo)  is  a 
partial  order  ( S ,  <s)  such  that  every  el-chain  a  of  (S,  <s)  has  a  least  upper  bound,  denoted 
lub(cr).  A  function  /  G  S  — >■  S  is  continuous  if  it  is  (1)  monotonic  and  (2)  preserves  least 
upper  bounds  of  w-chains,  i.e.,  for  all  w-chains  o,  /(lub(cr))  =  lub(((/(cr[/]))),;GN).  Among  the 
most  celebrated  and  useful  properties  of  te-cpos  is  [Gun92,  chapter  4]: 

Theorem  2.1.  Let  (S,  <s)  be  an  ce-cpo.  Let  /  be  a  continuous  function  in  S  — >  S,  and  let 
a:  6  S'  be  such  that  x  <s  f(x).  Then  /  has  a  hxed-point  in  the  upper-closure  of  x  in  S. 
Furthermore,  the  least  such  hxed-point  is  lub(((/7  (^)})*gn). 

Note  that  fl(x)  denotes  /  applied  i  times  to  x. 

Computing  Runs.  It  is  easy  to  check  that  CRun  is  an  w-cpo  with  least  element 

-L CRun  —  (Xx.Name.  (2.10) 

where  the  least  element  _L  qhm  of  CHist  is 

-L ciiist  =  (Xx:Name.  e).  (2.11) 

Recall  that  determinate  processes  are,  by  dehnition  (2.4),  monotonic  and  continuous.  It  is 
easy  to  check  that  this  restriction  on  determinate  processes  implies  that  for  all  np  G  Name  — > 
DProcess ,  step(np)  is  monotonic  and  continuous.  It  follows  from  Theorem  2.1  that  for  all 
np  €  Name  DProcess ,  step(np)  has  a  least  hxed-point  in  CHist ,  given  by 

crun(np)  =  lub(((step(7ipy(PCRun)))'leN)- 

Thus,  prefixes  of  the  system’s  behavior  can  be  calculated  by  starting  with  T  CRun  and  re¬ 
peatedly  applying  step(np). 

2.1.2  Concrete  Model  of  Non-Determinate  Systems 

Recall  that  Kahn’s  model  deals  only  with  determinate  systems,  i.e.,  systems  whose  compo¬ 
nents  are  deterministic  and  strict.  Restriction  to  determinate  systems  is  a  severe  limitation, 
and  approaches  to  eliminating  this  restriction  have  been  suggested  [BA81,BM82,SN85,Bro87, 
Jon89,Bro90,Rus90].  Here,  we  will  adopt  a  slight  variant  of  an  approach  due  to  Broy  [Bro87, 
Bro90].  Two  ideas  are  involved:  one  to  handle  non-determinism,  and  a  second  to  handle 
non-strictness.  The  idea  for  handling  non-determinism  is  to  represent  a  non-deterministic 
process  as  a  set  of  determinate  processes:  each  determinate  process  in  the  set  corresponds 
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to  one  possible  behavior  of  the  non-deterministic  process.  To  see  why  this  first  idea  alone 
is  insufficient  to  model  non-strictness,  consider  a  non-strict  process  first  that  waits  for  an 
input  from  process  X\  or  X2  then  echoes  that  input  on  its  output.  If  we  try  to  represent 
this  process  as  a  set  of  determinate  processes,  the  obvious  candidate  is  {dpl,  dp2 },  where  dpi 
waits  for  an  input  from  process  xL.  However,  this  is  not  equivalent  to  first ,  because  if  the 
only  input  comes  from  (say)  X\,  then  dp2  blocks  forever,  producing  no  output,  so  {dpx,  dp2} 
might  produce  no  output  on  this  input;  in  contrast,  first  definitely  produces  an  output.  One 
solution  is  to  indicate  that  dp2  represents  the  behavior  of  first  only  when  the  input  contains 
a  message  from  x2.  More  generally,  we  associate  with  each  determinate  process  dp  in  the  set 
an  input-restriction ,  which  is  the  set  of  inputs  for  which  dp  represents  a  possible  behavior  of 
the  component  [Bro87,Bro90].  Thus,  (non-determinate)  processes  are  in 

Process  =  Set  (IRProcess ),  (2.12) 


where  input-restricted  processes  are  in 

IRProcess  =  DProcess  x  Set(CHist).  (2.13) 


To  summarize,  for  an  input-restricted  process  p.  for  each  (dp,  ir)  G  p,  p  can  behave  like  the 
determinate  process  dp,  but  only  on  inputs  in  ir.  Thus,  the  set  of  possible  runs  of  a  system 
np  G  Name  — >  Process  of  non-determinate  processes  is 

cruns(np)  =  {cr  G  CRun  \ 

(3 ft.  G  Name  — >  IRProcess  : 

A  (\/x  G  Name  :  h(x)  G  np(x)) 

A  (\/x  G  Name  :  enabled(h(x ),  ((step( tti  o  hy(PcRun)(x)))ieN)) 

A  cr  =  lfp(step(/Ti  o  /&)))}, 

(2.14) 

where  7 q  projects  the  ith  component  of  a  tuple,  o  denotes  function  composition,  and  an 
input-restricted  process  p  G  IRProcess  is  enabled  for  a  chain  <7  G  Chain(CHist)  of  inputs  iff 

enabled(p,  a)  =  (cr  U  {lub(cr)})  C  ^(p).  (2.15) 


For  notational  convenience,  we  sometimes  (as  here)  regard  a  sequence  as  the  set  of  its 
elements. 

The  conjunction  in  (2.14)  is  written  using  Lamport’s  bullet-style  notation  [Lam93].  In 
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t  liis  notation,  the  conjunction  b\  A  b2  A  •  •  •  A  bn  is  written 

A  b\ 

A  b2 

A  bn 

Bullet-style  notation  can  be  used  for  disjunction  (V)  as  well. 

2.1.3  A  Running  Example 

To  illustrate  this  model,  we  return  to  the  two-stage  replicated  pipeline  of  Figure  1.1.  Here, 
we  consider  a  failure-free  version  of  that  system;  failures  will  be  considered  in  Chapter  3. 
The  two  stages  compute  the  functions  F  CN^N  and  G  G  N  — *■  N,  respectively. 2  The  system 
contains  a  source  S,  which  sends  a  number  i  to  three  processors  T\,  F2,  and  F3,  which  each 
compute  F(i)  then  send  the  result  to  the  next  stage  in  the  pipeline.  Processors  Gj .  G2,  and 
G';>  in  the  next  stage  compute  G  of  their  input  then  send  the  results  to  a  3-input  voter  V.  For 
convenience,  we  assume  that  processors  F\  F?>  and  G  \  G}>  are  well-behaved  on  unexpected 
inputs — specifically,  that  each  input  not  in  N  is  treated  as  if  it  were  the  input  “0” .  Voter  V 
waits  for  an  input  from  each  G, .  computes  the  majority  of  those  inputs,  and  then  sends  the 
result  to  an  actuator  A.  More  precisely,  the  voter  computes  a  3-way  majority  function  maj , 
which  is  any  function  in  N3  — *■  N  such  that  if  any  two  of  its  three  arguments  have  the  same 
value  i,  then  the  value  of  the  majority  function  on  those  arguments  is  i. 

Source  S  may  contain  physical  sensors  (e.g.,  a  keyboard  or  an  air-speed  indicator). 
Therefore,  we  model  the  source  as  a  non-deterministic  process  CSrc({Fi,  F2,  -F3}).  For 
dests  e  Set(Name),  CSrc(dests)  is  a  process  that  non-deterministicallv  selects  a  natural 
number  and  then  sends  it  to  each  component  named  in  dests.  Non-deterministic  process 
CSrc(dests)  is  defined  in  terms  of  a  family  of  determinate  processes:  CSrc^dests )  is  a  de¬ 
terminate  process  that  sends  the  number  i  to  the  components  named  in  dests.  Note  that 
CSrc(dests)  is  trivially  strict,  since  it  ignores  input  messages;  thus,  the  input-restriction 
associated  with  each  CSrCj(dests)  is  CHist ,  which  is  no  restriction  at  all. 

CSrc(dests)  =  (J  {(CSrCi( dests ),  CHist)}  (2.16) 

iGN 

CSrc,i(dests)  =  (A h:  CHist.  (Ax:  Name,  if  x  €  dests  then  ((?'})  else  t))  (2.17) 

2  F  is  overloaded:  it  is  both  a  symbol  in  our  language  of  values  (defined  in  Section  2.2.2)  and  the  name 
of  the  mathematical  function  represented  by  that  symbol.  Similarly  for  G  and  for  maj  below. 
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Processes  for  F\  F?>  and  G\-G  —  3  are  defined  as  instances  of  CComp(src ,  dest,  </>),  which 
is  a  process  that  receives  values  from  src ,  computes  a  function  d  on  those  values,  and  sends 
the  results  to  dest: 

CComp(src,  dest,  d>)  =  {(CComp1  (src,  dest,  <f>),  CHist)}  (2.18) 

C Comp' (src,  dest,  4>)  =  (Xh:  CHist.  (Xx:  Name.  (2.19) 

if  x  =  dest  then  i t>(h(src ))  else  £••))• 

where  <p  is  a  pointwise  extension  of  <j>  to  sequences  of  concrete  values,  with  values  not  in  N 
treated  as  zero.  For  example,  the  behavior  of  Fi  is  described  by  the  process  CComp(S,  G\,F). 

The  voter  V  in  this  system  is  a  process  CVoter(Gi,  G2,  G3,  A),  where  for  any  sj,  s2,  s3,  dest  6 
Name,  CVoter(si,  s2,  s3,  dest)  waits  (i.e. ,  produces  no  output)  until  it  has  received  an  input 
from  each  of  s3,  s2,  and  s3,  computes  a  majority  using  the  first  input  from  each  of  s' j .  s2, 
and  s3,  and  then  sends  the  result  to  dest.3 

CVoter(sii§2,  s3,  dest)  =  {(C Voter' (srcs,  dest),  CHist)}  (2.20) 

CVoter'(si,  s2,  s3,  dest)  =  (Xh  :  CHist.  (Xx  :  Name.  (2.21) 

if  x  =  dest  then 

if  h(si)  =  s  V  h(si)  =  s  V  h(si)  =  e  then  e 
else  ((maj (h(si)[0],  h(s2)[0],  h(s3)[0]))) 
else  e)) 

It  is  easy  to  generalize  this  definition  to  voters  that  have  an  arbitrary  number  of  inputs. 

The  actuator  is  the  process  CAct,  which  is  just  a  message  sink: 

CAct  =  {(CAct',  CHist)}  (2.22) 

CAct'  =  (Xh:  CHist.  (Xx:  Name,  e))  (2.23) 

For  this  system,  Name  =  {S,  Fi,  F2,  F3,Gi,G2,G3,V,  A}.  Let  npre  (re  is  mnemon¬ 
ic  for  “running  example”)  be  the  obvious  mapping  from  names  to  processes:  npre(S )  = 
CSrc({Fi,  F2,  F3}),  etc.  It  is  easy  to  check  that 

cruns(npre )  =  [J{crre(i)}  (2.24) 

iGN 

3This  defines  a  strict  voter.  Alternatively,  we  could  have  defined  a  non-strict  voter,  which  produces  an 
output  immediately  after  receiving  two  equal  inputs. 
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crre(i)  =  (Xy:  Name.  (Xx:  Name.  (2.25) 

if  x  =  S  A  y  e  {Fi,  F2,  F3}  then  ((*)) 
else  if  (:r,2/)  e  {(Fi,  G,),  (F2,  G2),  (F3,  G3)}  then  «F(i))) 
else  if  x  e  {Gb, G2, G3}  Ay  =  V  then  ((G(F(?)))) 
else  if  .t  =  V'  Ay  =  A  then  (( G(F(i )))) 
else  e)). 

Each  run  crre(i)  can  be  represented  by  a  graph  like  the  one  in  Figure  2.1  (page  11).  To  obtain 
a  finite  graphical  representation  of  the  infinite  family  of  runs  given  by  (2.25),  abstractions 
(such  as  those  described  in  the  next  section)  are  employed. 

2.2  Representation  of  Runs 

Computing  the  exact  set  of  runs  for  a  system  is,  in  general,  infeasible.  Thus,  the  model 
described  in  Section  2.1 — hereafter  called  the  concrete  model — is  inadequate  for  automated 
analysis  of  systems.  This  inadequacy  motivates  the  development  of  a  system  model  that 
incorporates  approximations.  To  distinguish  our  new  model  from  the  concrete  one  we  already 
discussed,  we  sometimes  refer  to  the  new  model  as  the  abstract  model. 

In  the  abstract  model,  a  system’s  behavior  is  approximated  by: 

Values:  The  data  sent  in  messages. 

Multiplicities:  The  number  of  times  each  value  is  sent. 

Orderings:  The  order  in  which  values  are  sent. 

This  section,  then,  describes  how  each  of  these  three  kinds  of  information  is  represented  in 
our  framework. 

Recall  that  in  the  concrete  model,  runs  are  characterized  by  the  sequences  of  messages 
sent  between  each  pair  of  components.  Thus,  our  abstract  model  is  based  on  a  language  for 
approximating  sequences  of  messages.  Specifically,  we  approximate  sequences  of  messages 
by  partially-ordered  sets  of  ms-atoms:  each  ms-atom  approximates  a  set  of  messages,  and 
the  partial  order  approximates  the  order  in  which  those  messages  appear  in  the  sequences. 
Note  that  this  order  corresponds  to  the  order  in  which  those  messages  are  sent  and,  since 
channels  are  assumed  to  be  FIFO,  also  the  order  in  which  they  are  received.  Let  L  denote 
the  set  of  ms-atoms.  A  particular  definition  of  L  appears  below;  for  now,  it  suffices  to  say 
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that  each  element  of  L  approximates  a  set  of  messages.  A  strict4  partial  order  on  a  set  S  is 
an  element  of 

Order(S)  =  {-<£  S  x  S  \  -<  is  acyclic  and  transitive}.  (2.26) 

Note  that  G  has  lower  precedence  than  all  set  constructors,  such  as  x .  The  strictly-partially- 
ordered  sets  ( posets  for  short)  over  a  set  T  is  just  a  subset  of  T  together  with  a  partial  order 
on  that  subset: 

POSet(T )  =  {(S',  -<)  G  Set(T)  x  Set(T  xT)  -<£  Order(S)}.  (2.27) 

Informally,  a  poset  (S',  -<)  G  POSet(L)  approximates  a  sequence  a  of  messages  if  there  exists 
a  correspondence  between  elements  of  S  and  elements  of  a  such  that:  (1)  each  ms-atom 
in  S  approximates  the  set  of  corresponding  messages,  and  (2)  if  i\  -<  72,  then  all  messages 
corresponding  to  ('\  precede  all  messages  corresponding  to  li  (this  definition  is  formalized  in 
Section  2.4.1). 

At  the  concrete  level,  sequences  of  messages  are  aggregated  into  concrete  histories  CHist , 
defined  in  (2.1),  to  represent  all  the  inputs  or  outputs  of  a  component.  Analogously,  at  the 
abstract  level,  we  aggregate  posets  of  ms-atoms  into  histories: 

Hist  =  Name  ->■  POSet(L),  (2.28) 

Histories  are  interpreted  using  the  same  conventions  as  concrete  histories:  when  a  history 
h  G  Hist  is  regarded  as  the  input  to  a  component  x,  h(y)  is  the  sequence  of  messages  sent  by 
y  to  x;  when  h  is  regarded  as  the  output  of  a  component  x,  h(y)  is  the  sequence  of  messages 
sent  by  x  to  y. 

Just  as  concrete  histories  are  aggregated  into  concrete  runs  to  represent  the  overall  be¬ 
havior  of  a  system,  in  the  abstract  model,  we  aggregate  histories  into  runs: 

Run  =  Name  Hist.  (2.29) 

As  in  the  concrete  model,  we  adopt  the  convention  that,  for  a  run  r  G  Run  and  a  component 
x,  history  r(x)  represents  the  inputs  to  x.  Informally,  the  meanings  of  histories  and  runs  are 
pointwise  extensions  of  the  meaning  of  posets  of  ms-atoms.  Note  that  a  run  r  G  Run  can  be 
interpreted  as  a  labeled  directed  graph  with  set  of  nodes  Name  and  with  edge  (x,  y)  labeled 

with  r(y)(x).  Since  an  (abstract)  run  can  approximate  many  concrete  runs,  it  suffices  to 

use  a  single  abstract  run  to  approximate  the  behavior  of  a  system.  Thus,  the  result  of  our 
analysis  is  a  single  run  that  approximates — or  represents — all  concrete  runs  of  a  system. 
4For  partial  orders,  “strict”  means  “not  reflexive”.  This  is  unrelated  to  strictness  of  processes. 
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The  remainder  of  this  section  defines  the  set  L  of  ms-atoms,  illustrates  the  definitions 
with  an  example,  and  discusses  our  approximation  of  message  orderings. 

2.2.1  ms-atoms 

A  ms-atom  has  signature 

L  =  Mul  x  Val  x  Tag.  (2.30) 

The  multiplicity  Mul  indicates  how  many  messages  are  represented  by  the  ms-atom;  the  value 
Val  describes  the  data  in  those  messages.  For  example,  by  analogy  with  regular  expressions, 
we  use  a  multiplicity  of  ?  to  denote  zero  or  one  messages,  and  a  multiplicity  of  *  to  denote 
an  arbitrary  number  of  messages.  For  values,  we  use,  for  example,  a  value  of  N  to  indicate 
that  the  data  is  a  natural  number.  These  are  all  examples  of  abstract  values:  each  specifies 
a  set  of  possible  concrete  value.  Symbolic  values — expressions  that  represent  a  single  but 
sometimes  undetermined  concrete  value — are  also  used  in  Val  and  Mul.  The  sets  Val  and 
Mul  and  our  treatment  of  message  orderings  are  discussed  in  the  following  subsections.  The 
tag  is  a  technicality.  It  allows  multiple  ms-atoms  with  the  same  value  and  multiplicity  to 
appear  on  an  edge;  an  alternative  is  to  use  bags  (i.e.,  multisets)  of  ms-atoms  with  signature 
Mul  x  Val. 

2.2.2  Values 

Values  are  approximated  by  a  set  of  possibilities.  Each  of  these  possibilities  is  determined 
by  a  symbolic  value  and  an  abstract  value,  rather  than  (say)  a  single  concrete  value.  Let 
A  Val  and  SVal  denote  the  sets  of  abstract  and  symbolic  values,  respectively.  Values  have 
signature 

Val  =  Vfin(SVal  x  AVal)  \  {0},  (2.31) 

where  Vfin(S )  is  the  set  of  finite  subsets  of  S,  and  \  denotes  set  difference. 

Abstract  Values 

An  abstract  value  characterizes  the  concrete  values  that  might  be  transmitted  in  particular 
messages.  This  idea  is  familiar  from  abstract  interpretation  [AH87]:  each  element  of  an 
abstract  interpretation  represents  a  set  of  concrete  values.  For  example,  one  way  to  abstract 
the  real  numbers  is  to  keep  track  only  of  the  sign.  The  corresponding  abstract  interpretation 
is  {0,  R_,R+,R},  where  0  represents  only  itself,  R_  represents  the  non-positive  real  num¬ 
bers,  R+  represents  the  non-negative  real  numbers,  and  R  represents  all  real  numbers.  The 
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abstract  value  R  is  included,  even  though  it  does  not  determine  the  sign,  in  case  the  sign  of 
a  value  is  not  known. 

For  generality,  we  do  not  fix  particular  abstract  values.  The  abstract  framework  is 
parameterized  by  the  set  AVal  of  abstract  values  and  their  meaning,  given  by  []uva;  £ 
InterpSet(AVal),  where  for  a  set  S, 

Interp  Set(S)  =  S  — >  Set(CVal).  (2.32) 


Symbolic  Values 

Symbolic  values  track  relationships  between  values.  To  motivate  the  need  for  this,  consider 
the  two-stage  replicated  pipeline  described  in  Section  2.1.3  and  depicted  in  Figure  1.1.  The 
outputs  of  voter  V  depend  on  equality  relationships  among  its  inputs.  Abstract  values 
can  provide  some  information  about  these  relationships.  However,  if  two  inputs  both  have 
abstract  value  N,  there  is  no  way  to  tell  (from  this  fact  alone)  whether  they  are  equal. 

Additional  relationships  can  be  determined  using  symbolic  values,  which  are  expressions 
composed  of  constants  Con  and  variables  Var  .  We  assume  the  sets  Con  and  Var  are  disjoint 
and  do  not  contain  the  underscore  symbol.  Informally,  a  constant  represents  the  same 
value  in  every  run  of  the  system,  while  a  variable  represents  values  that  may  be  different 
in  different  concrete  runs  of  the  system.5.  Symbolic  values  are  obtained  using  a  form  of 
symbolic  computation,  which  we  will  integrate  into  the  input-output  functions  and  hence 
into  the  fixed-point  analysis.  To  continue  the  pipeline  example,  note  that  if  multiple  inputs 
to  the  voter  are  represented  by  the  same  symbolic  value,  then  they  are  equal. 

Variables.  Each  variable  is  associated  with  a  single  component;  we  say  the  variable  is 
local  to  that  component.  Informally,  the  value  of  that  variable  is  determined  by  the  behavior 
of  that  component.  Making  each  variable  local  to  a  single  component  avoids  name  clashes 
that  would  otherwise  cause  trouble  when  components  are  assembled  to  form  a  system.  For 
example,  suppose  one  component  nondeterministically  selects  and  outputs  a  real  number, 
representing  it  by  the  value  { (X.  R) }.  Note  that  this  value  contains  a  single  possibility 
<-Y,R),  where  X  is  a  variable  and  R  is  an  abstract  value.  In  isolation,  this  representation 
is  fine.  But  suppose  another  component  in  the  system  also  nondeterministically  selects  and 
outputs  a  real  number,  representing  it  by  the  same  value  {(A,  R}}.  Since  the  two  components 
may  choose  different  values,  there  might  not  be  a  single  interpretation  of  X  that  “matches” 
sThis  is  made  precise  by  the  semantics  given  in  Section  2.4.1 
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the  overall  behavior  of  the  system  (i.e.,  that  matches  the  output  of  both  components).  This 
problem  is  avoided  if  each  component  uses  a  local  variable  to  represent  its  output. 

Let  Var(x)  denote  the  set  of  variables  local  to  component  x.  We  require  that  Var(x)  and 
Var(y )  be  disjoint  for  x  ^  y.  Var  is 

Var  =  (J  Var(x)  (2.33) 

x(  iXiinn 

For  example,  in  Figure  1.1,  variable  X  is  local  to  the  source  S. 

Constants.  The  value  of  a  constant  is  fixed  not  by  the  behavior  of  one  component  but  by 
an  interpretation  pc  G  Interp(Con),  where  for  any  set  S, 

Interp(S)  =  S  ->■  CVal  (2.34) 

For  example,  a  constant  min  might  be  interpreted  as  the  function  that  returns  the  minimum 
of  a  set  of  numbers;  a  constant  encrypt  might  be  interpreted  as  DES  encryption.  In  Fig¬ 
ure  1.1,  constants  F  and  G  represent  the  functions  computed  by  the  first  and  second  stage, 
respectively,  of  the  pipeline. 

Syntax  of  Symbolic  Values.  Symbolic  values  are  expressions  built  from  constants  and 
variables: 

SValo  =  Sym  U  {s(si, . . . ,  sn)  \  s  G  Sym  A  sq  G  SValo  A  •  •  •  A  sn  G  SVal0},  (2.35) 
where  the  set  Sym  of  symbols  is 

Sym  =  Con  U  Var.  (2.36) 

For  example,  if  the  constant  min  is  interpreted  as  above,  then  the  symbolic  value  min  (A'.  V.  Z) 
represents  the  minimum  of  the  values  represented  by  X,  Y .  and  Z. 

The  set  of  symbolic  values  is  obtained  from  SValo  by  adding  the  underscore  symbol, 
which  is  used  as  a  wildcard :6 

SVal  =  SValo  U  {_}  (2.37) 

As  in  pattern-matching  in  the  programming  language  ML  [MTH90],  the  wildcard  can 

represent  any  value.  This  is  especially  significant  when  the  wildcard  appears  in  ms-atoms 

6We  could  allow  the  wildcard  to  appear  within  larger  symbolic  values,  but  this  slightly  complicates  the 
semantics,  and  it’s  not  clear  whether  it  is  useful. 
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that  may  represent  multiple  messages.  Without  the  wildcard,  a  ms-atom  containing  a  value 
expressed  using  n  elements  of  SVal  x  AVal  could  represent  only  sets  of  messages  containing  at 
most  n  distinct  concrete  values;  if  the  value  contains  a  wildcard,  the  ms-atom  could  represent 
a  set  of  messages  containing  arbitrarily  many  distinct  concrete  values. 

Notational  Conventions.  Since  abstract  values  are  analogous  to  types,  we  sometimes 
write  s:a  to  denote  (s,  a)  G  SVal  x  AVal.  We  often  elide  the  braces  around  singleton  sets; 
for  example,  we  might  write  A" :  R+  to  denote  {(A,  R+}}  G  Val.  We  sometimes  elide  the 
wildcard;  for  example,  we  might  write  R+  to  denote  _:R+.  These  conventions  are  used  in 
Figure  1.1;  for  example,  the  value  in  the  new  ms-atom  on  edge  (F\ .  S)  is  actually  {{_,  Ty)}. 

Running  Example  Revisited.  The  concrete  runs  (2.25)  of  the  system  introduced  in 
Section  2.1.3  are  represented  by  Figure  2.2.  In  this  example,  X  G  Var(S'),  {F,G}  C  Con , 


Figure  2.2:  Run  for  running  example. 

and  [N]^^  =  N.  When  a  run  is  represented  using  a  graph,  edges  labeled  with  (0,0)  are 
elided.  Also,  if  the  poset  labeling  an  edge  is  a  singleton — that  is,  it  contains  only  a  single 
ms-atom  (! — then  we  display  the  poset  simply  as  the  ms-atom  i.  For  example,  the  poset 
on  edge  (S,Fi)  in  Figure  2.2  is,  when  written  more  explicitly,  ({(1,  A"  :N,O}},0).  This  is  a 
singleton  poset — i.e.,  it  is  a  pair  whose  first  component  is  a  singleton  set,  and  whose  second 
component  is  the  empty  partial  ordering — so  it  can  be  displayed  as  the  ms-atom  (1,  X :  N,  0). 
Since  the  multiplicity  in  this  ms-atom  is  1,  and  since  the  tag  is  0,  according  to  the  notational 
conventions  described  in  Section  2.2.3,  this  ms-atom  can  be  written  A" :  N,  as  it  appears  in 
the  figure. 

Note  that  the  voter’s  inputs  in  Figure  2.2  are  represented  by  the  same  symbolic  value. 
Thus,  the  voter’s  inputs  are  equal,  and  the  voter’s  output  is  also  represented  by  that  symbolic 
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value.  Section  2.3  describes  how  this  run  is  obtained.  Section  2.4.1  defines  the  notion  of  a 
run  representing  a  concrete  run;  informally,  it  means  that  for  each  edge,  the  messages  in  the 
concrete  run  can  be  obtained  from  the  poset  of  ms-atoms  by  duplicating  each  ms-atom  some 
number  of  times  consistent  with  its  multiplicity,  linearizing  the  copies  consistently  with  the 
partial  order,  then  substituting  for  each  variable  a  concrete  value  that  is  consistent  with  the 
abstract  values. 

2.2.3  Multiplicity 

Uncertainty  in  the  number  of  messages  sent  during  a  computation  stems  from  various  sources, 
including: 

Non-determinism  of  components.  Faulty  components  are  often  non-deterministic. 

Non-determinism  of  message  arrival-order.  This  may  cause  uncertainty  in  the 
number  of  outputs  of  a  component. 

Approximation  of  values.  Approximating  a  component’s  inputs  may  cause  uncer¬ 
tainty  in  the  number  of  its  outputs. 

Approximation  of  “loops”  of  communication.  When  a  set  of  components  send 
messages  back  and  forth  in  “loops”  of  communication,  determining  whether  the  compu¬ 
tation  terminates  is,  in  general,  impossible.  Thus,  determining  the  number  of  messages 
is  also,  in  general,  impossible. 

Uncertainty  in  the  number  of  messages  is  handled  in  the  abstract  framework  by  using 
multiplicities ,  which  are  approximations  of  numbers  of  messages.  For  example,  a  component 
subject  to  send- omission  failures  [HT94,  Section  2.3],  which  cause  a  component  to  possi¬ 
bly  omit  the  sending  of  each  message  normally  produced,  might  emit  each  output  with  a 
multiplicity  of  either  zero  or  one. 

Since  multiplicities  approximate  numbers  (of  messages),  we  can  represent  them  with 
elements  of 

Mul  =  Vfin(SVal  x  AMul)  \  {0},  (2.38) 

where  the  set  of  abstract  multiplicities  is 

AMul  =  {a  £  AVal  \  \a\AVal  £  (Set(N)  \  {0,  {0}})}  (2.39) 
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We  exclude  {0}  from  the  possible  meanings  of  abstract  multiplicities,  because  ms-atoms  are 
used  to  represent  messages,  and  a  ms-atom  with  abstract  multiplicity  denoting  {0}  would 
represent  no  messages. 

Abstract  Multiplicities.  Abstract  multiplicities  are  analogous  to  the  superscripts  in  reg¬ 
ular  expressions.  Recall,  for  example,  that  the  regular  expression  a*  represents  sequences  of 
a’s  of  arbitrary  length,  and  the  regular  expression  o?  represents  sequences  of  a’s  of  length  0 
or  1.  To  promote  the  resemblance  between  ms-atoms  and  regular  expressions,  we  assume  in 
examples  that  AVal  contains  the  following  elements  with  the  following  meanings: 


MuVaZ 

=  {1} 

(2.40) 

IvlUvaZ 

=  {0.1} 

(2.41) 

MuVaZ 

=  N 

(2.42) 

MUvaZ 

=  N\{0} 

(2.43) 

For  example,  the  outputs  of  a  component  subject  to  send-omission  failures  might  have  ab¬ 
stract  multiplicity  “?”. 

Symbolic  Multiplicities.  Symbolic  multiplicities  track  relationships  between  multiplic¬ 
ities  of  different  messages.  They  play  an  important  role  in  the  analysis  of  systems  whose 
fault-tolerance  involves  atomicity.  Atomicity  properties  are  typically  of  the  form:  “All  non- 
faultv  components  do  action ,  or  none  of  them  do.”  Such  properties  correlate  multiplicities 
of  actions  (e.g.,  message  receptions)  at  different  sites. 

For  example,  the  atomicity  requirement  in  reliable  broadcast  is:  for  each  broadcast  mes¬ 
sage,  either  every  non-faulty  process  delivers  the  message,  or  none  do.  Since  a  process  may 
crash  before  or  after  sending  a  particular  message,  the  transmission  and  hence  delivery  of 
a  message  is  not  guaranteed.  This  would  be  reflected  by  the  abstract  multiplicity  being  ? 
instead  of  1.  If  two  processes  each  receive  a  message  with  abstract  multiplicity  “?”,  however, 
we  could  not  determine  whether  the  atomicity  requirement  is  being  satisfied.  However,  if 
two  processes  each  receive  a  message  with  multiplicity  M  :?,  where  M  is  a  variable,  then 
we  know  that  either  both  processes  received  the  message  (i.e. ,  M  is  interpreted  as  zero),  or 
neither  did  (i.e.,  M  is  interpreted  as  one).  Reliable  broadcast  is  analyzed  in  detail  in  Section 


4.1. 
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Notational  Conventions.  The  notational  conventions  for  values  apply  to  multiplicities 
as  well.  To  foster  the  resemblance  between  ms-atoms  and  regular  expressions,  we  sometimes 
write  multiplicities  as  superscripts.  This  notation  is  used  only  if  the  tag  is  zero;  in  other 
words,  we  sometimes  write  the  ms-atom  (mul,val,  0)  as  valmul .  When  writing  a  ms-atom 
as  valmu\  we  usually  elide  the  multiplicity  if  it  is  { (_ ,  1) }  G  Mul.  Thus,  the  ms-atom 
{{{_,  1)},  veil ,  0)  may  be  written  simply  as  val. 

These  conventions  are  used  in  Figure  2.3,  which  represents  the  behavior  of  the  two- 
stage  replicated  pipeline  when  F\  suffers  a  Byzantine  failure.7  For  example,  the  ms-atom  on 
edge  (Fi,  S)  is,  when  written  more  explicitly,  (*,  Ty,  0);  since  the  tag  is  zero,  this  ms-atom  is 
displayed  as  Ty*.  Note  that  the  multiplicity  in  this  ms-atom  is,  when  written  more  explicitly, 
{(_,*}},  representing  an  arbitrary  number  of  messages.  The  multiplicity  in  the  ms-atom  on 
edge  (S',  Fi)  is  {{_,  1)},  so  it  is  elided.  As  another  example,  in  Figure  2.2,  the  multiplicities 
are  all  {{_,  1)},  and  the  tags  are  all  zero,  so  the  multiplicities  and  tags  are  all  elided. 


2.2.4  Tags 

Recall  that  tags  are  introduced  to  allow  multiple  ms-atoms  with  the  same  value  and  mul¬ 
tiplicity  to  appear  on  an  edge.  One  might  ask:  “If  the  values  are  the  same,  why  not  just 
combine  those  ms-atoms  into  a  single  ms-atom  with  a  ‘larger’  multiplicity?  This  would  elim¬ 
inate  the  need  for  tags.”  That  approach  is  indeed  possible  but  sometimes  undesirable,  since 
ordering  information  may  be  lost  when  the  ms-atoms  are  merged.  For  example,  consider  the 

'Figure  1.1  represents  the  same  system  in  the  same  failure  scenario.  However,  Figure  1.1  is  based  on  the 
perturbational  framework  of  Chapter  3,  while  Figure  2.3  uses  the  non-perturbational  framework  presented 
in  this  chapter. 
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poset  containing  the  three  ms-atoms 

4  =  (1,  X :  N,  0) 

=  <U':N,0) 

4  =  ( 1 .  .V  :  N .  1 ) . 

with  the  ordering  4  -<  lY  -<  t Note  that  4  and  4  differ  only  in  their  tag.  This  poset 
represents  sequences  of  three  messages  containing  natural  numbers  and  such  that  the  first 
and  last  messages  in  the  sequence  contain  the  same  number.  If  we  merge  the  two  ms-atoms 
containing  X  into  a  single  ms-atom,  such  as  (2,  A" :  N,  0),  we  will  not  be  able  to  express  that 
exactly  one  occurrence  of  the  number  represented  by  A"  appears  before  the  occurrence  of  the 
number  represented  by  Y . 

2.2.5  Message  Ordering 

The  partial  ordering  in  a  history  (2.28)  approximates  the  orderings  between  the  messages 
represented  by  the  ms-atoms  in  that  history.  Since  a  history  represents  the  messages  trans¬ 
mitted  along  a  single  channel  (i.e.,  between  a  single  pair  of  components),  only  orderings 
between  messages  sent  on  the  same  channel  are  reflected  in  our  representation  (2.29)  of 
runs.  As  in  Kahn’s  model,  orderings  between  messages  on  different  channels  are  ignored. 
This  simplifies  the  semantics  considerably.  The  disadvantage  is  that  the  behavior  of  non- 
strict  components  sensitive  to  inter-channel  orderings  cannot  be  specified  exactly;  the  output 
ms-atoms  of  such  a  component  must  represent  that  component’s  outputs  for  each  possible 
interleaving  of  the  inputs  from  different  sources.  Some  ideas  on  extending  the  framework  to 
include  inter-channel  orderings  are  mentioned  in  Chapter  6. 

2.3  Representation  of  Components 

Bv  analogy  with  definition  (2.4)  of  determinate  processes,  input-output  functions  have 
signature8 

IOF  =  {/  e  Hist  — >  Hist  \  tagUniform(f)},  (2.44) 

where  tagUniform(f)  is  a  sanity  condition  requiring  that  renaming  of  tags  in  the  argument  of 

/  causes  no  change  in  the  output  of  /  except  possibly  renaming  of  tags.  This  requirement  is 

sensible  because  tags  are  artifacts  of  the  formalism;  they  don’t  appear  in  actual  messages.  To 

8 The  analogy  is  not  perfect,  since  (2.44)  contains  — >  ,  whereas  (2.4)  contains  — » .  The  reasons  for  this 
are  discussed  in  Section  2.4.4. 
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formalize  this  requirement,  we  introduce  the  equality  =mst  on  Hist.  Informally,  h\  =mst  h2 
means  that  h  \  and  h2  are  the  same  up  to  renaming  of  tags — in  other  words,  that  h  1  can  be 
obtained  from  h2  by  renaming  the  tags  (or  vice  versa).  A  “renaming  of  tags”  is  a  function 
in  Tag  — >  Tag ,  where  for  any  sets  S  and  T ,  S  — i  T  is  the  set  of  injections  from  S  to  T .  We 
extend  such  a  function  h  G  Tag  — >  Tag  to  a  function  h  G  L  — *  L  by  applying  h  to  the  tag 
and  leaving  the  value  and  multiplicity  unchanged: 

h((mul ,  val,  tag))  =  ( mul ,  val ,  h(tag)). 

Now,  equality  on  POSet(L)  up  to  renaming  of  tags  is: 

(Si,  Ni)  =  poset(L)  (S2,  N2)  =  (3 h  G  Tag  -4  Tag  : 

A  S,  =  (  u  «*)}) 

xes2 

A^i=(  |J  (h(x),  %))))• 

{x,y)e<  2 

Ec(uality  =  is  the  pointwise  extension  of  =poset(L)-  With  the  definition  of  =#«*  in  hand, 
the  definition  of  tagUniform  is  easy: 

tagUniform(f)  =  (V«i,  m2  G  tfis*  :  int  =Hist  in2  =>  /(mi)  =Hist  f(in2)).  (2.45) 

System  Behavior  as  a  Fixed-Point.  As  in  the  concrete  case,  a  system’s  behavior  is 
characterized  by  a  fixed-point.  For  nf  G  Name  — >  /OF,  the  system’s  behavior  is  represented 
by  lfp(step(nf)),  if  this  fixed-point  exists.  We  can  always  look  for  a  fixed-point  by  repeated 
application  of  step(nf)  starting  from  Ap,un,  where 

ARun  =  (Xx:Name.  ±Hist)  (2.46) 

-L Hist  =  (Ax :  Name.  (0,0)).  (2.47) 

A  fixed-point  r  has  been  found  if  step(nf)(r)  =RUn  r-  where  the  equality  =RUn  on  Run  is  the 
pointwise  extension  of  the  equality  =mst  on  Hist. 

In  contrast  to  the  concrete  case,  this  fixed-point  might  fail  to  exist.  Section  2.5  discusses 
the  reasons  for  this  and  gives  additional  requirements  that  ensure  existence  of  a  fixed-point. 

2.3.1  Notation  for  Functions 

We  freely  use  standard  mathematical  constructs,  such  as  logical  formulas  and  lambda  expres¬ 
sions,  in  definitions  of  functions.  We  also  use  the  following  constructs  from  the  functional 
programming  language  CAML  Light  [Ler97],  a  dialect  of  Standard  ML  [MTH90]. 
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Conditionals.  The  conditional  expression  if  b  then  e\  else  e 2  has  the  obvious  meaning. 
Comments.  Comments  begin  with  (*  and  end  with  *). 

Binding.  The  binding  construct  is 

let  var  =  expr , 

(2.48) 

in  expr 2 

where  var  is  a  variable  and  expr }  and  expr2  are  expressions.  The  result  of  evaluating  (2.48) 
is  the  result  of  evaluating  expr2  in  a  context  in  which  var  is  bound  to  the  result  of  evaluating 
expr  | .  For  example,  the  Fibonacci  function  can  be  written 

/  =  (A i :  N.  if  i  =  0  V  i  =  1  then  1 
else  let  vi  =  f(i  —  1) 
in  let  ih  =  /(?'  —  2) 
in  vi  +  v2) 

To  save  horizontal  space,  we  sometimes  (as  above)  do  not  fully  indent  sequences  of  let 
expressions. 

Pattern  matching.  The  pattern-matching  construct  is 

match  expr0  with 
|  patt1  — >  exprl 
|  patt2  — >■  expr2 

|  pattn  — >  exprn 

where  each  expri  is  an  expression  and  each  patti  is  a  pattern.  A  pattern  is  composed  of 
data  constructors  (e.g.,  tuple  or  sequence  constructors,  or  any  constant)  and  variables.  This 
construct  evaluates  expr0  and  attempts  to  match  the  resulting  value  v  against  the  patterns, 
in  order  of  appearance.  A  match  occurs  if  v  can  be  obtained  from  the  pattern  by  some 
instantiation  of  the  variables  in  the  pattern;  as  a  special  case,  the  wildcard  pattern  _  matches 
anything.9  If  pattern  patti  yields  the  first  match,  then  the  variables  in  patti  are  bound  to 
the  values  that  cause  patti  to  ('qua I  v,  and  the  result  of  evaluating  (2.49)  is  the  result  of 
evaluating  expri  in  a  context  augmented  with  these  bindings.  For  example,  the  following 
9Note  that  the  wildcard  symbolic  value  is  while  the  wildcard  pattern  is _ 


(2.49) 
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function  accepts  a  number  or  pair  of  numbers,  and  returns  the  given  number  or  the  sum  of 
the  given  pair  of  numbers,  respectively. 

/  =  (Ax :  (N  U  (NxN)).  match  x  with 

|  {Xl,X2)  — >  X\  +  x2 
|  _  — *  x) 


2.3.2  Running  Example 

To  illustrate  the  abstract  model,  we  give  input-output  functions  that  represent  the  processes 
given  in  Section  2.1.3  for  the  replicated  pipeline  example.  An  input-output  function  / 
represents  a  process  p  if,  whenever  an  input  history  h  represents  a  concrete  input  history  eh, 
then  the  output  history  f(h)  represents  each  possible  concrete  output  history  of  p  on  input 
ch.  (The  notion  of  an  input-output  function  representing  a  process  is  formalized  in  Section 
2.4.1.) 

Definition  of  Src.  Source  S  is  represented  by  input-output  function  Src({Fi ,  F2,  F3}),  where 

Src(dests)  =  (Ah:  Hist.  (Ax :  Name,  if  x  G  dests  then  ({(1,  X :  N,  0)},  0)  else  (0,  0))),  (2.50) 

with  X  G  Var(S).  We  have  used  some  of  the  notational  conventions  described  in  Section 
2.2;  for  example,  the  multiplicity  1  abbreviates  { (_ ,  1) },  and  the  value  X  :  N  abbreviates 
{(.Y.N». 

Definition  of  Comp.  Processors  Ft  and  G,  are  represented  by  appropriate  instances  of 

Comp  (src ,  dest,  op)  =  (Ah:Hist.  (Ax:  Name,  if  x  =  dest  then  apOp(op,N)(h(src))  (2.51) 

else  (0,0))), 

where  for  op  G  Syrn.  aval  G  AVal,  and  val  G  Val .  the  value  apOp(op,  aval)(val)  G  Val  is 
defined  as  follows: 

•  If  the  abstract  value  associated  with  s  is  aval ,  then  ap  Op  (op,  aval)  (val)  is  value  ob¬ 
tained  by  applying  the  operator  op  to  each  symbolic  value  s  in  val  (think  of  aval  as 
the  domain  and  range  of  op). 

•  Otherwise,  we  take  the  result  of  applying  of  op  to  val  to  be  arbitrary;  specifically, 
ap  Op  (op,  aval)  (val)  is  Tv  (or,  written  out  in  full,  {(_,Ty)}),  which  represents  all 
concrete  values: 

CVal. 


(2.52) 
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Thus, 

apOp(op ,  aval)  (val)  =  U  (if  a  =  aval  then  ( apply  (op ,  ((s))),  am/)  (2.53) 

else  (_,Ty)}. 

For  a  symbol  op  G  Syrn  and  a  sequence  ((s  i, . . . ,  sn))  G  SVal,  apply  (op ,  ((.si, . . . ,  sn)))  returns 
the  symbolic  value  «p(.s-|.y ....  s„ )  if  all  the  s*  are  in  SValo',  otherwise  (i.e.,  if  any  s*  is  a 
wildcard),  it  returns  a  wildcard. 

apOp(op ,  aval )  is  an  extension  of  apOp(op ,  aval )  from  values  to  posets  of  ms-atoms;  it 
operates  on  the  values,  leaves  all  multiplicities  unchanged,  and  changes  tags  if  necessary  in 
order  to  avoid  “collisions”.  For  example,  if  the  poset  <5  contains  (1,X  :  Ty,0)  and  (1,1: 
Ty,0),  then  a  retagging  is  needed  to  avoid  a  “collision”,  and  we  have  (e.g.) 

apOp(F ,  N)((SI,  0))  =  {(1,  Ty,  0),  (1,  Ty,  1)}. 

Definition  of  Voter.  The  voter  is  defined  in  terms  of  two  auxiliary  functions:  ballot , 
which  extracts  a  vote  from  a  poset  of  ms-atoms,  and  tally ,  which  uses  ballot  to  extract  a  set 
of  votes  from  a  poset  of  ms-atoms,  then  tallies  those  votes  to  determine  the  majority  (if  any). 
Extracted  votes  and  outputs  of  tally  are  both  elements  of  Mul  x  (SVal  x  AVal),  indicating 
the  multiplicity  of  the  vote  and  the  value  voted  for.  The  definitions  of  ballot  and  tally  are 
given  below.  The  function  representing  the  voter  tests  whether  it  has  received  some  input 
from  each  source.  If  not,  it  just  “waits”,  i.e.,  produces  no  output  (represented  by  the  empty 
poset  (0,  0));  this  corresponds  to  the  then  branch  of  the  conditional.  If  it  has  received  some 
input  from  each  source,  it  calls  tally ;  this  corresponds  to  the  else  branch  of  the  conditional. 

Voter(srcs ,  dest ,  aval )  =  (A h-.Hist.  (Xx:  Name.  (2.54) 

if  x  =  dest  then 

if  (3y  G  srcs  :  iri(h(y))  =  0)  then  (0,  0) 
else  let  (mul,  val )  =  tally  (ballot ,  srcs,  aval,  h) 
in  ({(mul,  {val},  0)},  0) 
else  (0,  0))). 

Definition  of  ballot.  Ballot  condenses  the  input  S  G  POSet(L)  from  a  component  into 
a  single  multiplicity  and  a  single  element  of  SVal  x  AVal.  Roughly,  if  S  contains  only  a 
single  element  of  SVal  x  AVal  (i.e.,  a  single  value  with  cardinality  one),  then  ballot  returns 
that  value  and  the  associated  multiplicity.  Otherwise,  ballot  uses  a  coarse  approximation:  it 
returns  a  “top”  (i.e.,  an  arbitrary  multiplicity  and  value).  It  would  be  easy  to  make  ballot 
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more  precise,  by  having  it  extract  all  the  elements  of  SVal  x  AVal  in  S  (with  the  associated 
multiplicities)  and  return  them  as  a  set.  This  additional  precision  is  not  needed  to  analyze  the 
running  example,  because  non-faulty  components  do  send  exactly  one  value,  and  it  doesn’t 
matter  how  coarsely  inputs  from  faulty  processes  are  approximated,  since  those  inputs  are 
arbitrary  anyway.  This  approximation  in  ballot  reflects  our  general  style  in  writing  input- 
output  functions  for  examples:  we  write  out  the  “base  cases”  (which  typically  correspond 
to  singleton  sets)  exactly  but  don’t  bother  to  accumulate  sets  of  possibilities,  unless  that 
is  necessary  to  make  the  analysis  sufficiently  precise.  Of  course,  whether  an  input-output 
function  is  “sufficiently  precise”  depends  on  the  histories  on  which  that  function  will  be 
evaluated  and  hence  on  the  rest  of  the  system  being  analyzed. 

The  definition  of  ballot  is,  for  S  E  POSet(L ), 

ballot(S)  =  match  tti(S')  with 

|  {( mul ,  val ,  tag)}  — >  match  val  with 

|  {s:a}  — >■  (mul,  s:a) 

|  _  — >  (*  approximate  *) 

(mul,  Ty) 

|  _  — *  (*  approximate  *) 

<{-:*}, -;Tv> 

Definition  of  tally.  The  function  tally  (ballot,  srcs,  aval,  h)  extracts  and  counts  ballots 
cast  in  history  h  E  Hist  by  components  named  in  srcs  E  Seq(Name),  using  the  parameter 
ballot  E  POSet(L)  —?  (Mul  x  (SVal  x  AVal))  to  extract  the  ballots.  The  remaining  parameter 
aval  E  AVal  is  the  “type”  of  values  expected  by  the  voter:  if  all  the  ballots  have  abstract 
value  aval,  or  if  there  is  a  majority  of  equal  ballots  with  abstract  value  aval,  then  the  voter’s 
output  has  abstract  value  aval;  otherwise,  the  voter’s  output  value  may  be  arbitrary.  When 
the  abstract  value  of  the  voter’s  output  is  aval,  the  symbolic  part  is  determined  as  follows. 
Generally,  the  output  is  just  a  (symbolic)  application  of  the  operator  maj  to  the  symbolic 
values  in  the  ballots.  However,  if  a  majority  of  those  symbolic  values  are  the  same,  then  that 
application  can  be  simplified  (this  is  a  form  of  symbolic  computation),  yielding  just  that 
majority  symbolic  value.  Finally,  since  we  don’t  allow  wildcards  inside  applications,  if  the 
application  can’t  be  simplified  and  one  of  the  ballots  contains  a  wildcard,  tally  just  returns 


(2.55) 
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a  wildcard.  The  definition  of  tally  is 

tally  (ballot ,  srcs ,  aval ,  /i)  =  (2.56) 

let  blits  =  ( ballot  o  h)(srcs ) 

iiiet  mul  =  if  (Vi  G  dom(bllts)  :  definite(iri(bllts[i ])))  then  { (_ ,  1 ) }  else  {(_,?)} 
inlet  val  =  if  at  least  |~(|sres|  +  l)/2~|  ballots  are  for  some  (s,  a)  G  SValx  AVal  then 
if  (s  =  -)  V  (a  ^  aval)  then  (_,  Ty)  else  (s,  a) 
else  if  7r i  o  tt2 ( blits )  G  Seq(SVal0 )  A  tt2  o  W^(bllts)  G  5'e^({ava/})  then 
(apply (maj)(iri  o  ir2(bllts)),  aval) 
else  ( _ ,  T y ) 

in  (mul,  wa/) 

where  the  overline  denotes  pointwise  extension  (to  sequences),  a  multiplicity  is  definite  iff  it 
can’t  denote  zero,  i.e. , 


definite(mul)  =  (Vrc  G  mul  :  0  ^  [^(a;)]^^),  (2.57) 

and  for  a  set  5,  151  is  the  size  of  S.  The  constant  maj  has  interpretation  pc(maj)  =  maj.10 
This  definition  of  tally  incorporates  some  easily  removable  approximations;  for  example,  it 
ignores  symbolic  multiplicities. 

Definition  of  Act.  The  actuator  is  represented  by 

Act  =  (A h\Hist.  (Ax:  Name.  (0,0))).  (2.58) 

2.4  Semantics  and  Soundness 

This  section  relates  the  system  model  described  in  sections  2.2  and  2.3  to  the  concrete  model 
presented  in  Section  2.1.  Informally,  soundness  asserts  that  a  run  obtained  from  the  fixed- 
point  analysis  represents  all  possible  concrete  runs  of  the  system  of  interest.  Soundness 
allows  conditions  on  a  concrete  system  to  be  re-cast  as  conditions  on  that  (abstract)  run:  if 
that  run  satisfies  a  certain  condition,  then  all  concrete  runs  represented  by  that  run  satisfy 
a  related  condition,  so  all  concrete  runs  of  the  system  satisfy  that  related  condition. 

For  example,  suppose  we  want  to  check  for  the  replicated  pipeline  described  in  Section 
2.1.3  that  all  of  the  inputs  to  the  voter  V  are  equal.  According  to  the  semantics  below,  all 
concrete  runs  represented  by  a  run  r  satisfy  that  condition  if  all  input  ms-atoms  of  the  voter 


10  As  mentioned  in  Footnote  2,  maj  is  overloaded;  that’s  why  it  appears  on  both  sides  of  this  equation 
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V  in  run  r  together  contain  only  a  single  symbolic  value  and  that  symbolic  value  is  not  the 
wildcard.  Thus,  the  system  has  this  property  provided  the  run  r  in  Figure  2.2  (page  23) 
satisfies  the  predicate 

b(r )  =  (A r:Run.  let  S  =  UmNameW°^(^i(r(V)(x))) 
in  \S\  =  1  A  _gS), 

which  it  does. 

2.4.1  Semantics 

It  is  convenient  to  allow  partial  interpretations  in  the  semantics.  For  S  C  Syrn.  the  set  of 
partial  interpretations  of  S  is 

interp(S)  =  S  CVal ,  (2.59) 

where  S  T  is  the  set  of  partial  functions  from  S  to  T.  For  a  partial  (or  total)  function  /, 
dom(f)  is  the  domain  of  /.  The  ordering  on  partial  interpretations  is 

Pi  —  interp  P2  =  dom(pi)  C  dom(p2)  A  (Vs  e  dom(pi)  :  px(s)  =  p2{s)).  (2.60) 

Semantics  of  Posets  of  ms-atoms.  The  semantics  of  posets  of  ms-atoms  was  described 
informally  just  below  (2.27): 

Informally,  a  poset  (S',  -<)  E  POSet(L)  approximates  a  sequence  a  of  messages 
if  there  exists  a  correspondence  between  elements  of  S  and  elements  of  a  such 
that:  (1)  each  ms-atom  in  S  approximates  the  set  of  corresponding  messages, 
and  (2)  if  l\  -A  t2 ,  then  all  messages  corresponding  to  i\  precede  all  messages 
corresponding  to  l2. 

The  correspondence  between  the  elements  of  (£,-<)  and  the  elements  of  a  is  embodied  in 
a  function  g  E  dom(a)  S,  where  — ^  indicates  a  restriction  to  surjective  (onto) 
functions.11  Informally,  g{i)  is  the  ms-atom  representing  the  Ath  element  of  a.  We  use  a 
predicate  compat P0SepL)  to  check  that  the  correspondence  g 

(1)  respects  the  values  and  multiplicities  of  the  ms-atoms;  more  explicitly:  (la)  the 
concrete  value  cr[?']  is  represented  by  the  value  in  g(i),  and  (lb)  the  number  of  elements 
of  a  associated  with  each  ms-atom  t  is  represented  by  the  multiplicity  in  £• 

11  Thus,  S  T  contains  the  functions  /  in  S  — ►  T  that  satisfy  (Vt/  G  T  :  Etc  £  Sf(x )  =  y ). 
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(2)  respects  the  ordering  on  the  poset;  more  explicitly,  if  i\  -<  i2  and  g{i\)  =  l\  and 
g(i 2)  =  G,  then  k  <  i2. 

Note  that  these  two  conditions  correspond  to  the  two  conditions  in  the  informal  description. 

Condition  (1)  requires  formalizing  the  notion  of  a  concrete  value  cv  being  represented 
by  a  value  v  G  Val.  This  involves  two  conditions:  one  based  on  the  symbolic  part  of  v,  and 
one  based  on  the  abstract  part  of  v.  The  condition  based  on  the  symbolic  part  is  expressed 
by  extending  a  given  partial  interpretation  p  G  interp(Sym)  of  symbols  to  work  on  all  non¬ 
wildcard  symbolic  values.  The  extension  is  done  by  a  recursive  definition  of  the  structure 
of  the  symbolic  value,  which  corresponds  (in  a  sense)  to  evaluation  of  the  symbolic  value. 
If,  at  some  point  in  this  “evaluation”,  an  “error”  occurs  (e.g.,  the  operator  in  an  expression 
does  not  denote  a  function),  the  evaluation  aborts  and  returns  a  dummy  value  T  (which  is 
required  not  to  be  in  CVal).  For  p  G  interp(Sym),  the  extension  p  G  SValo  — >  ( CVal  U  {T}) 
is  given  by 

I  pis)  if  s  G  domip) 

P(s)  =  <  1 

( _L  otherwise 

p(s)((p(s  1),  ■  ■  ■  ,p(sn)))  if  a  s  G  dom(p) 

p(s(si,  •  •  ■ ,  sn ) )  =  <  A  (p(si), . .  . ,  p(sn))  e  dom0(p(s)) 

±  otherwise 

where  domo(f)  is  the  domain  of  /,  if  /  is  a  function,  and  0  otherwise.  We  use  this  extension 
to  define  the  predicate  compatVal,  which  checks  whether  a  concrete  value  cv  is  represented 
by  a  value  v  G  Val: 

compatpVal{val ,  cv)  =  (3(s,  a)  G  val  :  A  cv  E  (2.61) 

A  s  =  -  V  cv  =  p(s)). 

The  definition  of  compatPOSet ^  has  3  conjuncts,  corresponding  to  conditions  (la),  (lb), 
and  (2)  above.  Given  a  partial  interpretation  p  G  interp(Sym)  of  symbols,  poset  (S,  A)  G 
POSet(L)  represents  sequence  a  G  Seq(CVal)  under  correspondence  g  G  dom(cr )  — ^  S  iff 
compatpPOSet(Lj(S,  -<,  cr,  g)  holds,  where 

compatpPOSet(L)(S ,  A,  a,g)=  A  (W  G  S  :  (Vi  G  gmv(t)  :  compatpVal (7t2(£),  <t[*])))  (2.62) 

A  (Vf  G  S  :  compatpVal( 7n(£),  \gmv(£)\ )) 

A  (V(G,£2)  Gk:  g™{h)  <set{ n)  gmv{^)) 
where  gmv(y)  is  the  pre-image  of  y  under  g,  i.e. , 

9m(y)  =  {%  e  dom(g)  \  g(x)  =  y} 


(2.63) 
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and  -<srf(i\])  a  strict  partial  order  on  sets  of  natural  numbers: 

Si  -<Set( N)  S2  =  (V?'i  G  Si  :  (Vi 2  G  S2  :  i\  <  i2)).  (2.64) 

The  meaning  of  a  poset  ( S ,  -<)  G  POSet(L)  is  the  set  of  sequences  of  concrete  values  that 
it  represents: 

I(S,^)]poSri(L)=  iae  Seq(CVal)  \  (3g  G  dom(a)  ^  S  :  compatpPOSet(L)(S,  a,  #))},  (2.65) 

Note  that  if  p(s)  =  _L,  then  the  condition  cv  =  p(s)  in  compatpVal  cannot  hold,  so  pairs 
containing  s  in  a  value  are  effectively  ignored.  Thus,  increasing  a  partial  interpretation 
p  (with  respect  to  <mterp)  can  only  increase  [(S,  ~<)YPoset(L)  (with  respect  to  C);  in  other 
words,  [ Yposet(L)  nionotonic  in  p. 

Recall  that  tags  in  ms-atoms  are  just  a  technical  device  to  allow  multiple  ms-atoms  with 
the  same  value  and  multiplicity  to  appear  in  a  set.  Thus,  renaming  tags  should  have  no  effect 
on  the  meaning  of  a  poset  of  ms-atoms.  It  is  easy  to  check  that  |]  rosei.(L)  is  independent  of 
tags,  i.e. , 

{\/p  G  interp(Sym)  :  (VSi  G  POSet(L)  :  (V<S2  G  POSet(L)  : 

Si  =  POSet(L)  S-2  =>  lSi]Pp0Set^  =  [^IpoSepL)))). 

Semantics  of  Histories.  The  meaning  of  histories  is  a  straightforward  extension  of  []po5e<(L) 
For  p  G  interp(Sym), 

Mmst  =  ich  e  CHist  I  (Va’  e  Name  '  ch(x )  e  lh(x)lposet(i))}-  (2-66) 

Note  that  is  monotonic  in  p.  i.e., 

(Vpi,p2  G  interp(Sym)  :  (V/i  G  Hist  : 

(2.67) 

Pi  <interp  p2  =>  ^  WffiJ)' 

Semantics  of  Input-Output  Functions.  Informally,  an  input-output  function  /  rep¬ 
resents  a  process  p  if,  whenever  an  input  history  in  G  Hist  represents  the  concrete  input 
history  of  p,  then  f(in )  represents  the  concrete  output  history  of  p.  The  values  of  the  local 
variables  Ivar  of  the  component  can  be  chosen  freely  to  “match”  the  concrete  values  in  a 
sequence,  while  the  values  of  other  variables  may  already  be  constrained.  The  values  cho¬ 
sen  for  the  local  variables  in  f(in)  can  depend  on  the  inputs  to  the  process.  However,  to 
ensure  soundness,  this  dependence  must  be  monotonic,  i.e.,  additional  inputs  can’t  change 
the  values  chosen  for  those  local  variables.  In  other  words,  as  more  inputs  are  received,  the 
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values  of  more  local  variables  are  determined,  and  those  values  are  unchanged  when  yet  more 
inputs  are  received.  Formally,  the  dependence  of  the  values  of  local  variables  on  the  concrete 
inputs  is  captured  in  a  monotonic  and  continuous  function  g  G  C'Hist  — ^  interp(lvar) .  For 
p  G  Process ,  pc  G  Interp(Con),  Ivar  C  Far,  and  /  G  IOF , 

p  \ZpIQlpar  f  =  ( \/(dp ,  ir)  G  p  :  (3p  G  CHist  interp(lvar)  :  (2.68) 

(V pe  G  Interp(  Var  \  Ivar)  :  (Vm  G  Hist  :  ( Vch  G  ir  : 

d>  6  lmKTU3,ch)  =.  rfp(cA)  €  l/tmirc**"11)))))' 

Note  that  we  use  union  to  combine  functions  with  disjoint  domains  (if  functions  are  regarded 
as  sets  of  pairs,  this  does  not  even  require  overloading  U).  If  p  \ZpIQlpar  f.  we  say  f  represents 

p. 

Semantics  of  Systems.  An  abstract  system  comprises  a  mapping  nf  G  Name  — >  IOF 
and  a  partial  interpretation  pa  G  interp(Con)  of  constants.  The  partial  interpretation  al¬ 
lows  the  user  to  specify,  for  example,  that  maj  represents  a  majority  function,  or  that 
the  constant  encrypt  represents  DES  encryption.  The  meaning  of  systems  is  essentially  a 
pointwise  extension  of  the  meaning  of  input-output  functions:  for  np  G  Name  Process , 
nf  G  Name  — >  IOF ,  and  pa  G  interp(Con), 

np  \ZSys  {nf ,  pa)  =  (3pc  G  Interp(Con,  pa)  :  (V.r  G  Name  :  np(x )  C^0’Jar('c)  nf(x))),  (2.69) 

where  the  extensions  of  a  partial  interpretation  p  G  interp(S)  are 

Interp(S,  p)  =  {pi  G  Interp{S)  \  p  <interp  pi}.  (2.70) 

If  np  \Zsys  ( nf .  pa ) .  we  say  the  abstract  system  (n/,pa)  represents  the  concrete  system  np. 
To  reduce  clutter,  we  have  left  Var  implicit  in  the  notation  for  abstract  systems. 

Semantics  of  Runs.  For  r  G  Run  and  pa  G  interp(Con), 

Ir]*,  =  {cr  G  CRun  \  (3  pc  G  Interp(Con:  pa)  :  (3p„  G  Interp(Var)  :  (2.71) 

(Vo:  G  Name  :  cr(x )  G  [r(a:)]^^)))}. 

The  existential  quantification  over  pv  reflects  the  intuition  that  the  interpretations  of  the 
variables  in  a  run  can  be  chosen  freely  to  “match”  the  concrete  values. 
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2.4.2  Soundness 

Although  fixed-points  are  not,  in  general,  guaranteed  to  exist,  if  iteration  of  step(nf)  does 
lead  to  a  fixed-point  r,  soundness  means  that  r  represents  all  possible  finite  behaviors  of  the 
concrete  system,  i.e. ,  that  cruns^n(np)  C  [r]^re,  where 

cruns^n(np)  =  {cr  £  cruns(np)  \  (Vx  £  Name  :  (Vy  £  Name  :  \cr(y)(x)\  <  cc))},  (2.72) 

The  restriction  to  finite  behaviors  is  only  for  convenience;  the  framework  and  semantics  can 
be  extended  to  deal  with  infinite  behaviors  as  well. 

Soundness  is  expressed  by  the  following  theorem: 

Theorem  2.2.  For  all  np  £  Name  — >  Process ,  all  nf  £  Name  IOF ,  all  pa  £  interp(Con), 
and  all  ifp  £  N,  if  np  \Zsys  (nf  ,Pa)',  and  if  r  =  step{nf)lfp{PRun)  is  a  fixed-point  of  step(nf), 
then  all  finite  runs  of  the  system  are  represented  by  r,  i.e.,  cruns^n{np)  C  [r]^re. 

Proof:  Let  pc  £  Interp(Con,  pa)  witness  the  existential  quantification  in  np  \Zsys  {nf,pa)- 
Consider  any  ero  £  cruns^n (np) .  By  definition  (2.14)  of  cruns ,  there  exists  h  £  Name 
IRProcess  such  that 

A  (\/x  £  Name  :  h{x)  £  np(x)) 

A  (\/x  £  Name  :  enabled(h(x ),  ((step(7ii  o  hy(±cRun)(^)))ie n)) 

A  cro  =  \ip(step(7Ti  o  h)). 

Let  cr[i ]  =  step( tti  o  h)l{PcRun)  and  r[i]  =  step(nfy(±Run).  We  show  by  induction  that 

(Vi  £  N  :  (\/x  £  Name  :  cr[i]{x)  £  |r[i](a:)]^Pl’^)),  (2.73) 

where  pv[i]  =  LLej \rameg(^)(cr[i](x)),  where  for  all  x,  g(x)  £  CHist  — >  interp(Var(x ))  is 
a  witness  for  the  existential  quantification  in  np(x)  n.pIQ^ar<'x^  nf(x)  when  the  outermost 
universal  quantification  there  is  instantiated  with  ( dp ,  ir)  =  h{x). 

Base  Case.  For  i  =  0,  the  claim  is  that  (Vx  £  Name  :  FcRun  (x)  e  |T^re(a:)]^fl’[0]),  which 
follows  easily  from  the  definitions. 

Step  Case.  Using  the  induction  hypothesis  as  the  antecedent  of  the  implication  in  defini¬ 
tion  (2.68)  of  \ZpIQpar('x\  we  get 


(\/x  £  Name  :  TTi(h(x))(cr[i\(x))  £  lnf(x)(r[i](x))Yfi^tv^)- 


(2.74) 
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Monotonicity  of  all  the  ti\ (h(x))  implies  cr[i ]  <cruu  cr[i  +  1].  Monotonicity  of  all  the  g(x) 
then  implies  pv[i]  <mierp  Pv[i  +  !]•  So,  by  monotonicity  of  in  p  (from  (2.67)),  (2.74) 
still  holds  if  pv[i]  is  replaced  with  pv[i  +  1],  From  the  resulting  equation  and  definition  (2.6) 
of  step,  we  get  (Vx  G  Name  :  cr[i  +  1] (rr )  G  [r [/  +  l](^)]^fl’^+1^)-  This  completes  the  proof 
of  (2.73). 

Finally,  we  show  that  (2.73)  implies  cr0  €  Jr]^.  Since  cr0  is  finite,  there  exists  i0  G  N 
such  that  (Vi  >  i o  :  cr0  =  cr  [?']).  The  desired  result  is  obtained  by  instantiating  the  universal 
quantification  in  (2.73)  with  i  =  max(i^,?'0).  I 

2.4.3  Invariants 

In  writing  and  verifying  input-output  functions,  it  is  sometimes  convenient  to  introduce 
assumptions  about  the  values  of  variables.  These  assumptions  are  embodied  in  an  invariant 
that  has  signature 

Invar  =  {/  G  Name  — >  Set(interp(Var ))  |  (Vx  G  Name  :  I(x)  C  mterp(Var(x)))}.  (2.75) 

For  I  G  Invar  and  x  G  Name ,  I(x)  contains  the  partial  interpretations  of  Var(x)  that  satisfy 
the  invariant.  Note  that  this  definition  of  Invar  excludes  correlations  between  values  of 
variables  local  to  different  components;  otherwise,  \Ziof  could  not  be  verified  independently 
for  different  components. 

To  accommodate  invariants  in  the  semantics,  we  just  restrict  quantifications  over  inter¬ 
pretations  of  variables  to  being  with  respect  to  the  invariant.  The  changes  are  as  follows.  For 
Ii  C  interp(lvar)  and  Ie  C  Interp(Var  \  Ivar ),  define  p  n.jQpar’Il'I?‘  /  as  in  (2.68),  but  with 
interp(lvar)  replaced  with  Iu  and  Interp(Var  \  Ivar)  replaced  with  Ie.  Extend  an  abstract 
system  to  include  an  invariant  as  the  third  component  of  the  tuple.  For  I  G  Invar ,  define 
np  IZ sys  ( nf,pa,I )  as  in  (2.69),  but  with  lZjQparl'x'>  replaced  with  \Z pic^ar^xII^'I^NameVx}) ^ 
where  for  S  C  Name ,  the  restriction  of  I  to  S  is 

I  i  S  =  {p  G  Interp(UxeSVar(x))  \  (Vy  G  S  :  (Xv:  Var(y).  p(v))  G  /(?/))}.  (2.76) 

For  I  G  Invar ,  define  [r]^  as  in  (2.71),  but  with  Interp(Var)  replaced  with  1 1  Name. 
Modifying  the  proof  of  soundness  to  accommodate  invariants  is  straightforward. 

2.4.4  Sanity  Conditions  for  Input-Output  Functions 

As  mentioned  in  Section  2.1,  monotonicity  and  continuity  are  typically  required  of  input- 
output  functions  in  stream-processing  models,  in  order  to  eliminate  from  consideration  input- 
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output  functions  that  don’t  represent  any  process.  We  discuss  these  two  conditions  in  turn. 

Monotonicity.  Monotonicity  of  input-output  functions  is  defined  below  (see  (2.82)).  Infor¬ 
mally,  a  history  in  Hist  represents  a  set  of  concrete  histories.  Moving  up  in  the  ordering  on 
Hist  corresponds  to  either  extending  some  of  those  concrete  histories  (i.e. ,  replacing  them 
with  concrete  histories  that  are  larger  with  respect  to  <  cmst )  or  adding  “new”  concrete  his¬ 
tories  (i.e.,  concrete  histories  that  aren’t  extensions  of  old  ones).  The  former  aspect  of  <mst 
corresponds  directly  to  the  prefix  orderings  typical  of  stream-processing  models.  The  latter 
aspect  is  analogous  to  orderings  used  in  abstract  interpretation.  Typically,  it  corresponds  to 
an  increase  in  the  set  of  possible  behaviors  at  some  point  in  the  program  (e.g.,  to  an  increase 
in  the  set  of  possible  values  that  may  be  sent  by  some  send  statement).  It  is  interesting  to 
note  that  this  also  corresponds  (albeit  indirectly)  to  further  execution  of  the  system.  For 
example,  consider  analysis  of  a  pair  of  processes  that  repeatedly  reply  to  each  other.  As  the 
system  continues  to  execute,  the  processes  send  additional  messages  to  each  other,  so  the 
set  of  possible  values  in  the  messages  they  have  exchanged  (taken  collectively)  increases. 

Monotonicity  of  input-output  functions  can  be  interpreted  as  the  following  two  sanity 
conditions,  corresponding  to  the  two  aspects  of  the  ordering: 

1.  providing  additional  inputs  to  a  component  can’t  cause  the  component  to  produce 
fewer  outputs; 

2.  enlarging  the  set  of  possible  inputs  of  a  component  can’t  cause  the  set  of  possible 
outputs  of  the  component  to  shrink. 

There  is  no  technical  difficulty  in  augmenting  the  definition  of  IOF  with  monotonicity 
requirement  (2.82),  though  this  does  require  parameterizing  the  definition  of  IOF  by  Ivar 
and  pa.  We  omit  this  requirement  for  two  reasons.  First,  since  we  assume  that  the  input- 
output  functions  used  in  an  analysis  have  been  shown  to  represent  the  appropriate  processes, 
(other)  sanity  conditions  are  otiose.  Second,  this  requirement  can  complicate  input-output 
functions  by  forcing  “uniform”  use  of  approximations.  An  input-output  function  might  fail 
to  be  monotonic  simply  because  finer  approximations  are  used  on  larger  inputs.  Assuming 
the  finer  approximation  is  needed  on  large  inputs,  a  monotonicity  requirement  would  force 
finer  approximations  to  be  used  on  smaller  inputs  as  well.  However,  assuming  the  input- 
output  function  represents  the  process  of  interest,  there  is  no  compelling  reason  to  require 
this. 
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Continuity.  Continuity  is  defined  only  for  functions  between  cc-cpos.  As  discussed  in 
Section  2.5.4,  Hist  and  Run  are  not  cc-cpos,  unless  additional  conditions  are  imposed,  so  in 
general,  it  is  not  sensible  to  require  continuity  of  input-output  functions.  The  conditions 
in  Section  2.5  that  ensure  existence  of  fixed-points  also  trivially  ensure  continuity  of  input- 
output  functions. 

2.5  Termination  of  Fixed-Point  Calculations 

Since  our  goal  is  automated  analysis,  of  interest  are  conditions  under  which  the  fixed-point 
can  be  computed  in  a  finite  number  of  steps.  Of  course,  when  analyzing  any  particular 
system,  one  can  seek  a  fixed-point  without  knowing  whether  one  exists,  by  iterating  step(nf) 
until  either  a  fixed-point  is  found  (i.e. ,  applying  step(nf)  again  has  no  effect,  except  possibly 
renaming  of  tags)  or  computational  resources  are  exhausted. 

It  is  more  satisfying  to  know  before  starting  the  computation  whether  this  iteration  will 
terminate  with  a  fixed-point.  To  help  state  the  relevant  conditions,  we  define  an  ascending 
chain  of  a  partial  order  (5,  <s)  to  be  a  chain  a  G  Chain((S ,  <,$•})  in  which  no  two  consecutive 
elements  are  equal,  i.e.,  (Vi  G  (dom(a)  \  {0})  :  a[i  —  1]  ^  cr[i]).  It  follows  from  antisymmetry 
of  the  partial  order  that  all  elements  of  an  ascending  chain  are  distinct  .  The  basic  observation 
is  that  in  a  partial  order  with  no  infinite  ascending  chains,  fixed-points  can  be  computed 
in  a  finite  number  of  steps.  To  see  this,  let  S  and  /  be  as  in  Theorem  2.1.  If  S  has  no 
infinite  ascending  chains,  then  the  chain  ((f(i)))ie N  converges  to  the  least  fixed  point  in  a 
finite  number  of  steps. 

If  we  assume  that  S  has  no  infinite  ascending  chains,  then  some  of  the  other  hypotheses 
of  Theorem  2.1  can  be  weakened.  In  a  partial  order  with  no  infinite  ascending  chains, 
w-chains  trivially  have  least  upper  bounds  (to  see  this,  note  that  an  cj-chain  can  contain 
only  a  finite  number  of  distinct  elements,  so  the  largest  of  these  is  the  least  upper  bound 
of  the  chain).  Similarly,  monotonic  functions  are  trivially  continuous  (to  see  this,  note 
that,  as  in  the  previous  remark,  lub(cr)  =  xmax,  where  xmax  is  the  largest  element  in  cr,  so 
/(lub(cr))  =  f(xmax)]  monotonicity  of  /  implies  that  f{xmax)  is  the  largest  element  in  the 
cj- chain  /(cr),  so  lub(/(cr))  =  f(xmax )  =  /(lub(cr)),  as  desired).  Thus,  we  have  the  following- 
corollary  of  Theorem  2.1. 

Corollary  2.3.  Let  (S',  <s)  be  a  partial  order  with  no  infinite  ascending  chains.  Let  /  be 
a  monotonic  function  in  S  — r  S',  and  let  x  G  S  be  such  that  x  <s  f(x).  Then  /  has  a 
fixed-point  in  the  upper-closure  of  x  in  S.  Furthermore,  the  least  such  fixed-point  is  f'l(x). 
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where  i  is  any  natural  number  satisfying  fl(x)  =  fl+1(x). 

In  order  to  apply  Corollary  2.3  to  ensure  termination  of  the  fixed-point  calculation  for 
step  ( nf ) .  we  define  a  partial  order  <ruji  on  Run  and  find  conditions  under  which: 

1.  step(nf)  is  monotonic; 

2-  -L  Run  — 'iRun  Step  {nf)  (-Lfiara)  i 

3.  (Run,  <RUn)  has  no  infinite  ascending  chains. 

Since  step  is  defined  in  terms  of  input-output  functions,  we  discharge  the  first  obligation 
by  defining  orderings  on  Hist  and  showing  that  monotonicity  of  input-output  functions  with 
respect  to  these  orderings  implies  monotonicity  of  step  with  respect  to  <Run ■  To  discharge 
the  second  obligation,  we  show  that  _L Run  <RUn  step(nf)(Run)  holds  given  an  additional 
technical  assumption  about  the  input-output  functions.  To  discharge  the  third  obligation, 
we  give  conditions  under  which  Run  has  no  infinite  ascending  chains.  Finally,  we  discuss 
why  Run  is  not  in  general  w-complete. 

2.5.1  Monotonicity  of  step 

The  partial  ordering  on  Run  is  defined  in  terms  of  partial  orderings  on  Hist ,  which  are,  in 
turn,  defined  in  terms  of  a  partial  ordering  on  POSet(L). 

Ordering  on  Posets  of  ms-atoms.  Roughly,  the  partial  ordering  on  POSet(L)  is:  (Si,  -<b 
)  is  less  than  (£2,  -<2}  if  each  sequence  in  [{£1,  ~<i)]Poset(L)  lg  a  Prefix  of  a  sequence  in 
[(£2,  -<2 }Jposet(L)-  To  mat('  this  precise,  we  need  to  quantify  appropriately  over  interpreta¬ 
tions  of  the  variables.  When  interpreting  the  output  ms-atoms  of  a  component  x,  the  values 
of  its  local  variables  Var(x )  can  be  chosen  freely,  while  the  values  of  other  variables  may 
already  be  constrained.  So,  given  an  interpretation  pc  £  Interp(Con)  of  constants,  and  a  set 
Ivar  of  local  variables,  we  define  a  pre-order 

£1  —^pos'et(L)  ^2  =  (Vpv  €  Interp(Var)  :  (3pi  E  Interp(lvar)  :  (2.77) 

IfC  TlPcUp.,,  J  rc  n^u(Pi.ep;W 

IPlJ POSet(L)  —Set(Seq)  P2jpo5el{L)  >)i 

where  the  pre-order  -<Set(Seq)  011  sets  of  sequences  is 

£1  —Set(Seq)  £2  =  (V(7i  £  £1  :  (3cr2  £  £2  :  04  <Seq  U2)),  (2.78) 
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and  where  for  functions  /  and  g.  (/  ©  g)  is  /  updated  with  g ,  i.e. ,  (/  ©  g)(x)  is  g(x)  for 
x  G  dom(g),  and  is  f(x)  otherwise. 

It  is  easy  to  check  that  StposepL)  W  i11  f'act ■  a  pre-order,  i.e.,  that  it  is  reflexive  and 
transitive.  It  is  not  a  partial  order  because  it  lacks  antisymmetry.  This  lack  of  antisymmetry 
has  two  causes:  first,  the  pre-order  - <set(Seq )  is  not  antisymmetric;  second,  Run  contains 
distinct  runs  with  the  same  meaning,  as  discussed  in  Section  2.5.4.  We  construct  a  partial 
ordering  by: 

C  PcJvar  Q  A_  ../ q  ,pc,lvar  Q  \  /\  ( Q  _jPc,lvar  q  \  to  70 

*1  —  POSet(L)  ^2  -  V  —  p0Set(L)  ^2 )  A  £P0Set(L)  (Z.tV) 

V  Si  =  POSet(L)  S 2. 

The  first  disjunct  is  a  strict  partial  ordering;  the  second  disjunct  makes  the  ordering  reflexive. 
It  is  easy  to  check  that  <PposIt{L)'  D poset(L)'  independent  of  tags.  It  is  easy  to  check 
that  <poset(L)  is  a  partial  order  on  POSet(L)  quotiented  by  =poset(L),  be.,  on  POSet(L)  with 
elements  related  by  =poset(L)  considered  equivalent.  Informally,  the  construction  of  <poset(L) 
ensures  antisymmetry  by  removing  orderings  in  Pposet(L)  between  semantically  equivalent 
posets  and  between  posets  whose  meanings  are  related  in  both  directions  by  <$et(Seq)  (e.g., 
if  one  is  the  prefix-closure  of  the  other). 


Orderings  on  Histories.  The  local  variables  in  a  set  of  ms-atoms  are  always  the  variables 
associated  with  the  sender,  so  the  ordering  on  histories  depends  on  whether  the  histories  are 
regarded  as  input  histories  or  output  histories.  For  histories  regarded  as  inputs, 

hi  —in Hist  h2  =  (Vrr  G  Name  :  h^x)  SpofjgL)  M^))-  (2-80) 

For  histories  regarded  as  output  of  a  component  with  local  variables  Ivar , 

hi  <At  h2  =  (Vx  G  Name  :  hx(x)  <%%L)  h2(x)).  (2.81) 


Thus,  an  input-output  function  /  is  defined  to  be  monotonic  with  respect  to  Ivar  C  Var 
(intuitively,  these  are  the  local  variables  of  the  component  /  represents)  and  pa  G  interp(Con) 
iff 


(Vpc  G  Interp(Con,  pa)  :  (V/q  G  Hist  :  (V/12  G  Hist 


hi  < 


Pc 

In  Hist 


h2  =©  Uhl)  <^!msi  m)))) 


(2.82) 


Ordering  on  Runs.  The  ordering  on  runs  is  just  a  pointwise  extension  of  the  ordering  on 
histories  regarded  as  inputs: 

P  <Run  r2  =  (Vpc  e  Interp(Con,  pa )  :  (Sy  G  Name  :  n (y)  <ZHist  r2(y))).  (2.83) 
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Monotonicity  of  step. 

Theorem  2.4.  For  all  nf  G  Name  — >  /OF  and  all  pa  G  interp(Con),  if  for  all  x  G  Name , 
nf(x )  is  monotonic  with  respect  to  Var(x)  and  pa,  then  step(nf)  is  monotonic  with  respect 

tn  <pa 
10  21  Run- 

Proof:  Assume  r\  <Run  r2.  Using  monotonicity  of  nf(x ),  then  expanding  the  definition  of 
^Outms?  and  recognizing  that  step(nf)(r)(y)(x)  =  nf(x)(r(x))(y)t  we  have 

(V pc  G  Interp(pa)  :  (\/x  G  Name  :  (\/y  G  Name  : 

step  ( nf  )(r1)(y)  (x )  < poS('!)  step(nf)(r2)(y)(x))))$ 

which  is  equivalent  to  step(nf)(r\ )  <Run  step(nf)(r2).  I 

Semantic  Ordering  vs.  Syntactic  Ordering.  To  obtain  the  most  general  orderings  on 
histories,  we  have  defined  the  orderings  directly  in  terms  of  the  semantics.  A  more  “syntac¬ 
tic”  ordering  might  be  more  convenient  for  verifying  monotonicitv  of  particular  functions. 
However,  such  characterizations  are  more  restrictive,  hence  not  as  widely  applicable.  For 
example,  it  seems  difficult  to  formulate  a  general  “syntactic”  ordering  with  respect  to  which 
the  input-output  functions  used  in  analysis  of  reliable  broadcast  in  Chapter  4  are  monotonic, 
because  they  wouldn’t  be  monotonic  if  max  were  replaced  with  min  in  their  definitions. 

2.5.2  The  First  Step 

To  show  that  the  first  application  of  step  yields  a  larger  run,  i.e. ,  that  -LRun  <Run  step(nf)(  NRun), 
we  need  an  additional  assumption  about  input-output  functions.  This  is  necessary  because 
±Run  is  not  a  least  element  in  (. Run,<Run ).  To  see  this,  note  that  {NRun]p£un  =  _L CRun,  so 
any  “meaningless”  run  (i.e.,  any  run  r  such  that  [r]^re  =  0)  is  less  than  -LRun. 

As  in  the  definition  of  the  strict  part  of  <poset(L ),  it  is  necessary  here  to  ensure  that 
semantically  equivalent  posets  of  ms-atoms  are  not  substituted  for  each  other.  When  we 
start  the  fixed-point  calculation  with  _L  cRun  represented  by  .  RllJi.  the  empty  sequence  of 
messages  on  each  edge  is  represented  by  the  empty  poset.  If,  after  one  step,  some  edges  still 
have  the  empty  sequence  of  messages  on  them,  then  we  require  that  those  empty  sequences 
still  be  represented  by  the  empty  poset,  rather  than  some  semantically  equivalent  poset. 
More  precisely,  if  a  process  p  has  no  initial  outputs  to  a  component  y ,  then  we  require 
that  f(-i-Hist)(y )  =  (0,0).  Formally,  we  augment  the  definition  (2.68)  of  p  \ZPc0F  f  with  the 
conjunct 

(yy  G  Name  :  noInitialOut(p,y )  f(-LHist)(y)  =  (0,0)),  (2.84) 
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where 

noInitialOut(p,y)  =  ( V(dp ,  ir)  G  p  :  ±cmst  <E  «r  =>■  dp(±CHist)(y)  =  ■£)■  (2.85) 

This  suffices  to  establish  the  following  theorem. 

Theorem  2.5.  For  all  np  €  Name  Process ,  all  nf  G  Name  I  OF ,  and  all  pa  G 

interp(Con),  if  np  \ZSys  { nf,pa ),  then  PRun  <p°un  step(nf)(PRun). 

Proof:  Expanding  the  definitions  of  step(nf)  and  <Run ,  we  need  to  show  that  for  all  pc  G 
Interp(Con ,  pa), 

(V.x  G  Name  :  (My  G  Name  :  V  n/(a:)(Tffis<)(?/)  =n«  (0,0)  (2.86) 

V  A  (Vp,„  G  Interp(Var)  :  (3 pi  G  Interp(Var(x ))  : 

{4  =<*«»»)  [»/(*)(j-«..)(!/)]Sf£^,>)) 

A  (3p,„  G  Interp(Var)  : 

lnf  (%)(3-Hist)(y)YpOSet(L)  Yset(Seq)  {  r  }  ) 

Suppose  p  might  initially  send  a  message  to  y,  i.e. ,  there  exists  (dp,  ir)  G  p  such  that 
-L Hist  e  ir  A  dp(±CHist)(y)  Y  #-  Then  by  definition  of  t 

(3 pi  G  Interp(Var(x ))  :  (Vpe  G  Interp(Var  \  Var(x ))  : 

'//d  •  r//,,/)(//)  e  lnf(x)(±mst)(y)]p;oYet(Z))- 

It  is  easy  to  check  that  this  implies  the  second  disjunct  in  (2.86).  Otherwise — that  is,  if  p 
does  not  initially  send  a  message  to  y — the  antecedent  in  (2.84)  holds,  so  the  first  disjunct 
in  (2.86)  holds.  I 

2.5.3  Finite  Ascending  Chains 

We  give  below  a  simple  set  of  conditions  that  ensures  Run  has  no  infinite  ascending  chains. 
We  also  give  corresponding  conditions  on  input-output  functions  that  ensure  the  necessary 
closure  property,  namely,  that  application  of  step(nf)  to  a  run  satisfying  those  conditions 
yields  a  run  that  also  satisfies  those  conditions.  If  the  input-output  functions  for  a  system 
nf  satisfy  these  corresponding  conditions,  then  the  fixed-point  iteration  for  that  system  is 
guaranteed  to  terminate. 

The  conditions  on  Run ,  and  hence  the  corresponding  conditions  on  the  input-output 
functions,  are  stronger  than  necessary.  These  conditions  do  hold  for  the  input-output  func¬ 
tions  used  in  the  running  example.  For  classes  of  systems  for  which  these  conditions  are 
too  restrictive,  more  flexible  (but  probably  more  complicated)  conditions  could  be  used  to 
establish  termination  of  the  fixed-point  iteration  (provided  it  does  terminate). 
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Restrictions  on  Runs.  All  ascending  chains  in  Run  are  finite  iff  all  ascending  chains  in 
POSet(L)  are  finite,  so  we  look  at  conditions  for  ensuring  the  latter.  One  set  of  sufficient 
conditions  is  as  follows.  For  each  n  >  0,  let  FACn  be  the  subset  of  POSet(L)  that  contains 
an  element  (S,~<s)  iff 

1.  the  size  of  S  is  at  most  n; 

2.  the  size  of  each  element  of  Mul  and  Val  occurring  in  (a  ms-atom  in)  S  is  at  most  n. 
Note  that  the  size  of  (say)  a  multiplicity  is  its  size  (cardinality)  as  a  set. 

Furthermore,  we  require  that  A  Val  have  size  at  most  n.  These  conditions  together  ensure 
FACn  has  only  finite  ascending  chains.  Let  Runn  =  Name  — >  ( Name  — >  FACn). 

One  might  think  that  the  following  weaker  condition  on  A  Val  is  sufficient:  require  that 
A  Val  have  no  infinite  ascending  chains  with  respect  to  the  ordering  a  |  <AVai  a2  =  [ai]Lyaj  A 
Ia2]^var  However,  this  condition  is  too  weak;  roughly,  it  deals  only  with  singleton  sets 
of  abstract  values,  not  with  larger  sets  of  abstract  values.  For  example,  suppose  A  Val  = 
Ui€N{^i ,  xl },  with  the  meanings  defined  by  recursion  on  i:  for  a  G  {0,1},  [^ol^vai  =  {«} 
and  ^  =  x f  U  {max(r“)  +  2,max(ic"+1  mod  2}  \  {max(rf)}.  It  is  easy  to  show  that 

A  Val  has  no  infinite  ascending  chains  but  POSet(L)  does. 

Restrictions  on  Input-Output  Functions.  To  ensure  that 

r  G  Runn  =>-  step(nf)(r)  G  Runn , 

it  suffices  to  require  that  each  input-output  function  satisfy: 

1.  if  the  size  of  each  input  poset  is  at  most  n,  then  the  size  of  each  output  poset  is  at 
most  n; 

2.  if  the  size  of  each  Mul  and  Val  in  the  input  is  at  most  n.  then  the  size  of  each  Mul 
and  Val  in  the  output  is  at  most  n. 

2.5.4  Run  is  not  an  o;-cpo 

Not  all  w-chains  in  Run  have  a  least  upper  bound.  To  see  this,  consider  first  the  co- 
chain  (((Uj<j{(l,  val,  j)},  0)))ieN  of  POSet(L),  where  val  is  any  value  and  we  have  taken 
Tag  to  be  N.  This  w-chain  has  no  least  upper  bound,  because  ({(*,  val,  tag)},  0)  and 
({(?,  val,  0),  (*,  val,  0)},  0)  are  incomparable  upper  bounds  of  it.  They  are  incomparable 
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(with  respect  to  <p0set(L) )  because  they  have  the  same  meaning  but  are  not  “syntactically” 
equal  (i.e.,  are  not  related  by  =poset(L))-  By  analogous  reasoning,  the  cj-chain 

((((A  e-.Name  x  Name.  U  j<i  {(1,  val,  j)}),  0)»ieN 
of  Run  has  no  least  upper  bound. 

Intuitively,  POSet(L)  and  Run  are  not  w-cpo’s  because  posets  of  ms-atoms  are  not  canon¬ 
ical  representations  of  sets  of  sequences  of  concrete  values;  in  particular,  there  are  different 
posets  with  the  same  meaning.  Although  simple  examples  like  the  one  above  are  easily  pro¬ 
hibited,  a  general  solution  is  complicated  by  abstract  values  whose  meanings  overlap  and  by 
algebraic  identities  among  constants. 

2.6  Sanity  Conditions  for  ms-atoms 

The  definitions  in  Section  2.2  could  be  augmented  with  numerous  sanity  conditions.  For 
example,  we  could  associate  an  arity  with  each  symbol,  and  require  that  all  interpretations 
assign  to  each  symbol  a  function  of  appropriate  arity.  We  could  also  introduce  a  type  system 
to  ensure  statically  that  functions  are  applied  only  to  values  in  their  domain.  Although  it 
is  easy  to  formulate  conditions  of  this  nature  that  ensure  that  each  ms-atom  in  isolation  is 
meaningful  (i.e.,  represents  some  set  of  messages),  it  is  difficult  to  find  equally  powerful  sanity 
conditions  on  sets  of  ms-atoms,  because  ensuring  that  a  set  of  ms-atoms  represents  some  set 
of  messages  requires  checking  satisfiability  of  the  constraints  implied  by  the  symbolic  values. 
Such  checks  would  be  feasible  only  if  the  framework  were  specialized  to  specific  abstract 
values  and  to  constants  with  specific  interpretations.  Since  these  sanity  conditions  are  not 
needed  to  ensure  soundness  of  our  analysis,  we  omit  them.  Constructing  specialized  versions 
of  the  framework  for  specific  application  areas  may  be  worthwhile,  since  it  would  facilitate  use 
of  the  framework  for  synthesis  of  fault-tolerant  systems,  by  helping  ensure  that  refinement 
does  not  lead  to  an  unimplement  able  design. 

To  illustrate  the  difficulty  of  ensuring  that  a  set  of  ms-atoms  is  meaningful,  we  consider 
the  set  of  ms-atoms  {A"  :  R+,A  +  1  :  R_},  where  the  constants  +  and  1  have  their  usual 
interpretation.  Note  that  the  interpretation  pc(+)  of  +  as  a  constant  symbol  is  distinct 
from  its  meaning  as  an  abstract  value.  It  should  be  clear  that  no  set  of  messages 

is  represented  by  this  set  of  ms-atoms,  because  there  is  no  value  for  A"  that  satisfies  all  the 
constraints.  Using  the  semantics  in  Section  2.4.1  and  the  interpretation  of  R_  and  R+  in 
Section  2.2.2,  this  is  equivalent  to  the  claim  that  there  is  no  real  number  x  such  that  x  >  0 
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and  x  +  1  <  0.  It  follows,  again  according  to  the  semantics  in  Section  2.4.1,  that  an  input- 
output  function  that  returns  such  ms-atoms  in  its  output  does  not  represent  any  process. 
Since  soundness  of  the  analysis  is  a  meaningful  issue  only  for  input-output  functions  that  do 
represent  processes,  omitting  these  sanity  conditions  does  not  cause  unsoundness. 


Chapter  3 


Analyzing  Systems  that  Fail 


This  chapter  describes  two  methods  for  doing  fault-tolerance  analysis.  The  first,  presented 
in  Section  3.1,  extends  the  framework  introduced  in  Chapter  2  to  specify  how  a  component 
behaves  when  it  fails.  Some  of  the  limitations  of  this  approach  are  discussed  in  Section  3.2. 

The  remainder  of  the  chapter  then  describes  an  extension  to  this  framework  that  over¬ 
comes  these  limitations.  We  increase  the  expressiveness  of  specifications  in  order  to  capture 
non-trivial  relationships  between  values  in  the  failure-free  and  faulty  executions.  The  effects 
of  failures  are  represented  explicitly  as  changes  (or  perturbations)  to  failure-free  behavior. 
Section  3.3  extends  the  concrete  model  to  capture  correlations  between  a  component’s  failure- 
free  and  faulty  behaviors.  Sections  3.4  and  3.5  then  extend  the  representations  of  runs  and 
components,  respectively,  in  the  abstract  system  model. 

3.1  Fault- Tolerance  Analysis  Without  Perturbations 

Any  form  of  fault-tolerance  analysis  will  require  descriptions  of  possible  component  failures 
and  fault-tolerance  requirements.  For  our  method,  these  are  dealt  with  in  Sections  3.1.1 
and  3.1.2,  respectively.  Section  3.1.3  illustrates  these  definitions  using  the  running  example 
introduced  in  Chapter  2. 

3.1.1  Behavior  of  Failure-Prone  Systems 

A  component’s  behavior  depends  on  what  failures  it  can  suffer.  This  dependency  can  be 
reflected  by  parameterizing  each  component  by  its  possible  failures.  Let  Fail  be  the  set  of 
all  failures  of  interest,  with  distinguished  element  OK  £  Fail  representing  absence  of  failure. 
A  process  is  represented  by  a  function  p  whose  domain  is  the  set  of  possible  failures  of  this 
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process  and  such  that,  for  each  fail  E  dom(p ),  p(fail)  describes  the  component’s  behavior 
when  that  failure  is  present.  This  parameterization  is  used  at  the  concrete  and  abstract 
levels;  thus,  we  have 

ProcessF  =  {p  E  Fail  —  Process  \  OK  E  dom(p)}  (3.1) 

IOFf  =  {f  E  Fail  —  IOF  \  OK  E  dom(f)}  (3.2) 

For  example,  we  use  crash  E  Fail  to  indicate  that  a  component  crashes  at  some  unspecified 
time  during  a  computation.  Consider  a  process  p  E  Process f  subject  only  to  crash  failures, 

i.e. ,  dom(p )  =  { OK ,  crash}.  Suppose  the  output  from  p  to  q  on  a  certain  input  history  hin 

is  F(A')1,  i.e.,  p{OK){hin){q)  is  the  singleton  poset  containing  the  ms-atom  F(X)1.  Since 
p  might  crash  before  or  after  sending  this  message,  p(crash)(hjn)(q)  must  represent  both  of 
these  possiblities;  for  example,  p(crash)(hin)(q)  might  be  the  singleton  poset  containing  the 
ms-atom  F( X)7 .  Note  that  in  our  terminology,  a  failure-  that  is,  an  element  of  Fail ,  such 
as  crash — is  not  itself  an  event  that  occurs  during  a  computation;  rather,  a  failure  is  simply 
a  token  indicating  what  (if  any)  erroneous  behavior  occurs  during  a  computation. 

Recall  that  a  failure  scenario  associates  a  failure  with  each  component  of  a  system.  This 
is  true  at  the  concrete  and  abstract  levels.  At  both  levels,  a  system  is  represented  by  a 
function  with  signature  Name  — >  ( Fail  — >■  S )  for  some  S  ( S  is  either  Process  or  IOF).  For 
any  function  /  with  such  a  signature,  the  set  of  failure  scenarios  for  /  is 

FS(f)  =  {fs  E  Name  — >  Fail  \  (Vs  E  Name  :  fs(x)  E  dom(f(x)))}.  (3.3) 

It  is  convenient  to  define 

fs  ok  =  (As:  Name.  OK).  (3.4) 

We  say  that  a  component  x  is  non-faulty  in  failure  scenario  fs  iff  fs(x)  =  OK. 

The  concrete  runs  of  a  concrete  system  np  E  Name  Processp  for  a  given  failure 
scenario  fs  E  FS(np)  are  given  by 

crunsF(np)(fs)  =  cruns(Xx:  Name.  np(x)(fs(x))).  (3.5) 

The  behavior  of  an  abstract  system  nf  E  Name  — >  IOF F  for  failure  scenario  fs  E  FS(nf)  is 
represented  by  runF(nf)(fs)  =  ]fp(stepF(nf,fs)),  if  it  exists,  where 

step F(nf ,  fs)  =  step(Xx :  Name,  nf  (x)(fs(x))).  (3.6) 

The  meaning  of  IOF F  is  given  by  a  slight  extension  of  the  definition  (2.68)  of  \Zwf-  For 
p  E  Process p  and  /  E  IOF  F . 

p  C/ofT  /  =  dom(p)  =  dom(f)  A  {\ifail  E  dom(p)  :  p(fail)  f(fail)).  (3.7) 
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3.1.2  Fault-Tolerance  Requirements 

A  fault-tolerance  requirement  imposes  conditions  on  the  system’s  possible  behaviors  in  cer¬ 
tain  failure  scenarios.  Since  a  system’s  possible  behaviors  are  approximated  as  a  run,  a 
fault-tolerance  requirement  is  expressed  as  a  mapping  from  failure  scenarios  to  predicates  on 
runs.  We  use  a  mapping,  rather  than  just  a  single  predicate  on  runs,  so  that  requirements 
involving  graceful  degradation,  in  which  stronger  requirements  are  associated  with  failure 
scenarios  involving  fewer  or  less  catastrophic  failures,  can  be  expressed.  A  predicate  on  runs 
is  a  function  with  signature  Run  — >  B,  where  the  booleans  are 

B  =  {true,  false}.  (3.8) 

A  system  nf  £  Name  — >  IOF F  satisfies  a  fault-tolerance  requirement  b  £  FS(nf)  — >  ( Run  — > 
B)  iff  for  each  fs  £  FS ( nf ) .  runF(nf)(Js )  exists  and  satisfies  b(fs).  The  only  sanity  require¬ 
ment  on  b  is  that  it  be  independent  of  tags,  i.e. , 

(Y/s  £  dom(b)  :  (Vrur2  £  Run  :  n  =Run  r2  =>  b(fs)(r i)  =  b(fs)(r2))).  (3.9) 

Verifying  that  a  concrete  system  satisfies  its  fault-tolerance  requirement  requires  checking 
that  the  processes  are  represented  by  the  input-output  functions  of  an  abstract  system  nf , 
and  for  each  failure  scenario  fs  £  FS(nf ),  checking  that  runF(nf)(fs)  exists  and  satisfies 

Hfs)- 


3.1.3  Running  Example 

We  develop  concrete  and  abstract  models  of  a  two-stage  replicated  pipeline  in  which  the 
processing  components  Ft  and  G,  produce  arbitrary  values  when  faulty.  These  models  build 
on  the  definitions  for  the  failure-free  two-stage  replicated  pipeline  in  Sections  2.1.3  and  2.3.2. 

Concrete  System.  Process  CCompF(src ,  dest,<f)  describes  a  processing  component  that 
normally  behaves  the  same  as  CComp(src,  dest,  <f>),  defined  in  (2.18),  and  that  produces 
arbitrary  values  when  faulty. 

CCompF(src:  dest,  <p)  =  (A fail:  {OK ,  valFail}.  (3.10) 

if  fail  =  OK  then  CComp(src ,  dest,  <f) 
else  CValFail(src ,  dest,  CVal )), 
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where  the  process  CValFail(src,  dest,  S)  sends  an  arbitrary  value  in  S  C  CVal  to  dest  G 
Name  whenever  it  receives  a  value  from  src  G  Name: 

CValFail(src,  dest,  S)  =  (J  {(CValFailg(src,  dest),  CHist)}  (3.11) 

gEdests^Seq(S) 

CValFailg(src ,  dest)  =  (A h:  CHist.  (Xx:  Name,  if  x  =  dest  then  (3.12) 

5(x)[0..(|/i(src)|  -  1)] 
else  £)), 

where  for  any  sequence  a  and  any  natural  numbers  i  and  j ,  a [>..]]  is  the  contiguous  subse¬ 
quence  of  a  from  position  i  to  position  j  (inclusive);  as  a  special  case,  we  define  cr[0. . ( — 1 )]  =  e. 

At  this  point,  we  won’t  consider  failures  of  the  source,  voter,  or  actuator,  so  these  com¬ 
ponents  correspond  to  the  same  processes  as  before,  but  with  a  trivial  lambda  abstraction 
wrapped  around;  e.g.,  for  the  source, 

CSrcp(dests)  =  (A fail:  {OK}.  CSrc(dests)).  (3.13) 

Concrete  Runs.  The  concrete  runs  of  this  system  can  be  computed  using  definition 
(3.5)  of  crunsF.  Let  npp  be  the  obvious  mapping  from  Name  to  ProcessF:  npp(S)  = 
CSrcF({Fi,  F2,  -F3}),  etc.  As  an  example,  consider  the  failure  scenario  fs ,  in  which  only  T\ 
fails.  It  is  easy  to  check  that 

crunsF(npp)(fs1)  =  (J  er^(i,  cw), 

iEN,cvECVal 

where  crF(i ,  cv )  G  CRun  is  the  same  as  crre(i )  except  that  the  sequence  of  concrete  values 
from  F\  to  G\  is  replaced  with  ((cv)),  and  the  sequence  from  G\  to  V  is  replaced  with 

({02  (CV))). 

Abstract  System.  The  input-output  functions  representing  the  processors  T)  and  G,  are 
appropriate  instances  of 

CompF(src ,  dest,  op)  =  (A fail:  {OK,  valFail}.  (3.14) 

if  fail  =  OK  then  Comp  (src,  dest,  op) 
else  ValFail(src,  dest,  v  ) ) , 

where  Ty  is  defined  by  (2.52),  and  ValFail(src,  dest,  aval)  sends  an  arbitrary  value  repre¬ 
sented  by  aval  G  AVal  to  dest  G  Name  when  it  receives  an  input  from  src  G  Name.  The 
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definition  of  ValFail  is  similar  in  structure  to  definition  (2.51)  of  Comp : 

ValFail(src,  dest,  aval)  =  ( Xh:Hist .  (Xx:  Name.  (3.15) 

if  x  =  dest  then  arbval(aval)(h(src )) 
else  (0,0))), 

where  arbval(aval)(val)  =  {(_,  aval)},  and  arbval(aval)  is  the  pointwise  extension  of  arbval(aval) 
from  values  to  posets  of  ms-atoms;  as  in  the  extension  of  apOp  discussed  below  (2.53),  re¬ 
tagging  may  be  needed  to  avoid  collisions. 

Again,  since  we  won’t  consider  for  now  failure  of  the  source,  voter,  or  actuator,  the  input- 
output  functions  for  those  components  are  the  same  as  in  Section  2.3.2,  but  with  a  trivial 
lambda  abstraction  wrapped  around;  e.g.,  for  the  source, 

Srcp^dests )  =  (A fail:  {OK}.  Src(dests)).  (3.16) 

Fault- Tolerance  Requirement.  Suppose  the  fault-tolerance  requirement  for  this  system 
is:  if  at  most  one  component  F)  or  G,  fails,  then  the  input  to  the  actuator  remains  the  same 
as  in  the  failure-free  case.  Thus,  the  fault-tolerance  requirement  is 

bre  =  (A  fs :  FS(nfp).  (A  r\Run.  \{x  G  Name  \  fs(x)  ^  OK}  \  <  1  =>  b0(fs,r)))  (3.17) 

where  nfp  is  the  obvious  mapping  in  Name  IOF F  for  this  example,  and  bo  is  a  predicate 
that  expresses  that  the  inputs  to  the  actuator  are  unchanged.  Finding  a  suitable  predicate 
bo  is  slightly  tricky.  A  natural  attempt  is:  bo(fs,r )  holds  iff  the  abstract  inputs  to  the 
actuator  are  the  same  as  in  the  failure-free  run,  i.e. ,  iff  r(A)  =RUn  runp{nf)(fs ok) (A)-  Note 
that  runp{nf){fs  ok)(A)  is  the  run  in  Figure  2.2.  Thus,  more  specifically,  bo  (fs .  r )  says  that 
the  sole  input  ms-atom  to  the  actuator  is  from  V  and  has  multiplicity  of  one  and  value 
G(F(X)):  N. 

This  specification  is  both  unnecessarily  restrictive  and  too  weak.  These  problems  both 
result  from  the  tacit  assumption  that  the  variable  X  represents  the  output  of  the  source 
in  faulty  runs.  It  is  unnecessarily  restrictive  because  in  general,  the  input-output  function 
for  the  source  could,  as  a  result  of  inputs  received  from  a  faulty  component,  use  a  different 
variable  to  represent  the  source’s  output,  even  if  the  output  itself  is  not  really  affected  (at 
the  concrete  level).  For  example,  in  Figure  1.1.  faulty  component  Fi  sends  new  inputs  to 
source  5;  those  new  inputs  could  cause  the  input-output  function  for  S  to  use  in  its  output  a 
different  variable  than  it  would  have  used  otherwise.  The  specification  is  too  weak  because, 
if  the  source  does  receive  inputs  from  a  faulty  component,  those  inputs  may  cause  the  value 
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represented  by  X  to  change,  in  which  case  this  specification  does  not  ensure  that  the  input 
value  of  the  actuator  is  unchanged. 

To  remedy  the  weakness  of  this  specification,  we  take  bo(fs,r )  to  be  the  conjunction  of 
two  conditions: 

1.  The  poset  runF(nf)(Js 0K)(A)  uniquely  determines  the  inputs  to  the  actuator  as  a 
function  of  the  interpretation  of  the  variables  that  appear  in  it.  More  precisely,  we 
require  that  history  runF(nf)(fs 0K)(A)  is  unambiguous.  A  history  is  unambiguous  if 
each  poset  of  ms-atoms  in  it  is  unambiguous.  A  poset  of  ms-atoms  is  unambiguous  if 
the  ordering  is  total  and  each  value  and  multiplicity  in  each  ms-atom  is  unambiguous. 
An  element  of  Val  (including  elements  of  Mul )  is  unambiguous  if  it  is  a  singleton  set 
{(s,  a)}  and  either  s  €  SVal0  or  IHLvaJ  =  1. 

2.  The  concrete  values  represented  by  variables  that  appear  in  runF(nf)(fs0K)(A )  are 
unaffected  by  failures.  More  precisely,  we  require  that  in  each  failure  scenario  fs  of 
interest,  the  input  history  of  the  actuator  A  is  structurally-unaffected.  Structurally- 
unaffected  is  the  least  predicate  satisfying  the  following  recursive  definition.  The  input 
history  of  component  x  of  system  nf  in  failure  scenario  fs  is  structurally-unaffected 
if  runF(nf)(fs)(x)  =RUn  runp(nf)(fs ok)(x)  and,  f°r  each  variable  Y  that  appears  in 
runF(nf)(fs)(x ),  Y  is  local  to  a  non-faulty  component  whose  inputs  are  unambiguous 
and  structurally-unaffected. 

We  illustrate  these  two  concepts  with  some  examples.  First,  we  illustrate  “unambiguous”. 
Let  p  be  an  interpretation  of  variables,  i.e. ,  p  G  Interp(Var).  The  value  {A"  :  N.  Y  :  N} 
is  ambiguous  because  it  can  represent  either  p(X)  or  p(Y).  On  the  other  hand,  the  value 
X  :  N  (note  that  we  have  elided  the  curly  braces  around  a  singleton  value,  as  per  the 
notational  conventions  on  page  23)  is  unambiguous,  because  for  a  given  interpretation  p,  it 
can  represent  only  one  fixed  concrete  value,  namely,  p( X).  The  poset  {{A,A},0),  where 
A  =  X :  N  and  i\  =  X :  N  (note  that  we  have  elided  the  multiplicity  (of  one)  and  the  tag  (of 
zero),  as  per  the  notational  conventions  on  page  26),  is  ambiguous,  because  it  can  represent 
either  ((p(X),  p(Y)))  or  ((p(Y),  p(X))).  On  the  other  hand,  the  poset  ({A,  A},  {(A?  A)})  is 
unambiguous,  because  it  can  represent  only  ((p(X),  p(F))). 

We  illustrate  “structurally-unaffected”  using  Figure  2.3  (page  26),  which  shows  the  be¬ 
havior  of  a  two-stage  replicated  pipeline  when  component  f)  suffers  a  Byzantine  failure. 
The  input  history  of  source  S  is  not  structurally-unaffected,  because  it  is  not  the  same  as 
the  input  history  of  the  source  in  failure  scenario  fs0K  (shown  in  Figure  2.2).  Informally, 
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this  reflects  the  fact  that  the  inputs  of  S  are  potentially  affected  by  the  failure  of  F\ .  Since 
variable  X  is  local  to  S,  the  valued  represented  by  X  may  depend  on  the  inputs  of  S.  Thus, 
the  value  of  X  is  potentially  affected  by  the  failure  of  F\ .  Since  X  appears  in  the  input 
history  of  component  G-> .  the  input  to  G2  is  also  potentially  affected  by  this  failure.  This 
explains  informally  why  the  input  history  of  G2  in  Figure  2.3  is  not  structurally-unaffected. 

The  introduction  of  explicit  perturbations  in  Section  3.4  is  motived  in  part  by  the  awk¬ 
wardness  of  this  specification  and  the  concomitant  difficulty  of  being  sure  that  it  has  the 
intended  meaning.  In  the  perturbational  framework,  the  requirement  that  inputs  to  the 
actuator  are  unchanged  can  be  expressed  concisely  as  the  property  that  the  perturbations 
associated  with  those  inputs  equal  the  “identity”  perturbation,  which  denotes  unchangedness 
(details  are  in  Section  3.5.1). 

The  reader  might  wonder  whether  a  history  containing  unambigous  posets  should  be  con¬ 
sidered  unambiguous,  since  in  a  sense,  such  a  history  does  not  uniquely  determine  the  inputs 
to  a  component  as  a  function  of  the  interpretation  of  the  variables  in  the  history — specifically, 
the  history  says  nothing  about  the  ordering  between  inputs  from  different  senders.  This  is 
true,  but  it  is  not  cause  for  concern,  because  in  our  abstract  system  models,  orderings  between 
inputs  from  different  senders  are  always  ignored.  For  example,  suppose  the  fault-tolerance 
requirement  for  some  system  is  that  the  inputs  to  a  certain  component  are  unaffected  by 
failures.  Consider  the  input  histories  of  that  component  in  the  runs  computed  for  failure 
scenario  fs0K  and  for  some  other  failure  scenario.  Suppose  these  input  histories  are  equal, 
unambiguous,  and  structurally-unaffected.  These  input  histories  may  still  represent  mul¬ 
tiple  interleavings  of  concrete  inputs  from  different  senders.  Since  this  “ambiguity”  exists 
independently  of  whether  component  failures  are  being  considered,  it  has  no  impact  on  fault- 
tolerance  analysis  and  therefore  can  be  ignored  in  our  definition  of  “unambiguous” .  In  terms 
of  our  example,  since  each  interleaving  of  concrete  inputs  represented  by  the  input  histories 
is  a  possible  input  to  the  component  in  the  absence  of  failures,  each  of  these  interleavings 
must  also  be  an  acceptable  input  to  the  component  in  the  presence  of  failures. 

With  the  input-output  functions  and  fault-tolerance  requirement  just  given,  we  can  check 
whether  the  system  satisfies  its  fault-tolerance  requirement  by  computing,  for  each  failure  sce¬ 
nario  fs  G  FS(7ifp),  the  fixed-point  r  =  lfy(step  F(7ifF)(fs))  and  checking  whether  bre(fs)(r) 
is  true.1  The  outcome  in  this  case  is  affirmative.  Of  course,  if  it  is  not  already  established 
that  the  input-output  functions  represent  the  appropriate  processes,  this  must  be  checked 

1An  obvious  optimization,  based  on  definition  (3.17)  of  bre ,  is  to  compute  the  fixed-point  only  for  failure 
scenarios  in  which  at  most  one  component  is  faulty. 
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as  well.  Neither  of  these  two  obligations  involves  verifying  properties  of  the  potentially 
complicated  functions  computed  by  the  two  stages  of  the  pipeline;  for  example,  verifying 
that  CCompF  is  represented  by  CompF  requires  little  more  than  checking  that  CCompF  is 
deterministic  and  stateless.  Thus,  the  abstractions  introduced  so  far  do  allow  separation  of 
concerns  in  this  example. 

3.2  Motivation  for  Changes 

This  section  discusses  some  limitations  of  the  analysis  method  just  described. 

3.2.1  Expressiveness 

In  systems  where  the  appropriate  input  histories  are  unambiguous  and  structurally-unaffected, 
a  predicate  in  the  style  of  (3.17)  seems  to  capture  (albeit  awkwardly)  the  intended  meaning, 
i.e.,  that  the  inputs  to  the  actuator  are  unaffected  by  failures.  However,  even  when  the  sys¬ 
tem  is  fault-tolerant,  using  input-output  functions  that  make  the  appropriate  input  histories 
unambiguous  might  be  awkward,  and — worse — the  appropriate  input  histories  might  not  be 
structurally-unaffected.  We  discuss  these  two  issues  in  turn. 

Unambiguous 

For  concrete  systems  comprising  only  determinate  components,  it  is  always  possible  to  con¬ 
struct  an  abstract  model  of  the  system  such  that  the  input-output  functions  produce  only 
unambiguous  ms- atoms.  Roughly,  this  can  be  done  by  introducing  constant  symbols  for 
all  of  the  functions  computed  by  the  components,  including  boolean  functions  used  in  the 
guards  of  conditionals  and  loops,  as  well  as  functions  used  to  compute  output  values.  For  ex¬ 
ample,  consider  a  process  that  sends  fi(x)  to  Ai  if  the  input  value  x  satisfies  some  condition, 
and  sends  <p(x)  to  A2  otherwise.  An  input-output  function  representing  this  process  might 
produce  on  input  X  an  output  history  in  which  the  messages  sent  to  .4 1  are  represented  by 
F(X)m^x)-'  and  the  messages  sent  to  A2  are  represented  by  F( X)M2^:?.  This  representa¬ 
tion  is  unambiguous.  Note  that  M1  and  M2  represent  the  predicate  in  the  conditional,  and 
F  represents  <p.  If  we  “simplified”  the  input-output  function  by  omitting  (say)  Mi,  then  the 
output  to  .4 1  would  be  simply  F( A")?,  which  is  ambiguous. 

Abstract  systems  with  unambiguous  posets  can  be  constructed  even  if  the  system  contains 
non- determinate  components.  This  requires  introducing  local  variables  to  represent  explicitly 
all  choices  made  by  the  components.  For  example,  consider  a  process  first  that  forwards 
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its  first  input  to  A.  Suppose  first  receives  A'|  from  S\  and  X2  from  S2.  A  natural  but 
ambiguous  representation  of  its  output  to  A  is  the  poset  ({{1,  {-F(Ab),  F(X2 )},  0)},  0).  One 
unambiguous  representation  is  the  poset  ({G,  O},  {{0,  O}}),  where  4  =  F( Ad)y,:?.  Note 
that  l'i  and  Y2  represent  the  outcome  of  the  non-determinate  choice,  and  -<=  { V .  Y2  } .  The 
same  technique  of  introducing  local  variables  applies  to  internal  non-determinism  as  well  as 
non-strictness. 

In  summary,  constructing  input-output  functions  that  produce  unambiguous  outputs 
sometimes  requires  introducing  additional  symbols  in  their  outputs.  Furthermore,  the  out¬ 
puts  of  an  input-output  function  often  contain  the  symbols  that  appear  in  its  inputs,  so  when 
computing  the  run  for  a  system,  the  symbols  introduced  in  the  outputs  of  one  input-output 
function  are  often  propagated  by  other  input-output  functions.  Thus,  a  symbol  introduced 
to  make  an  output  unambiguous  may  eventually  appear  in  many  ms-atoms  in  a  run.  In¬ 
troducing  and  propagating  these  symbols  makes  computation  of  runs  more  expensive  and 
clutters  the  runs,  making  them  harder  to  read  (reading  runs  is  sometimes  useful,  e.g.,  to 
see  why  a  system  does  not  satisfy  its  fault-tolerance  requirement).  Since  often  we  care  only 
about  the  sensitivity  of  values  to  changes  caused  by  failures,  a  more  concise  representation 
can  be  achieved  by  specifying  this  information  directly,  rather  than  encoding  it  in  additional 
symbols. 

Structurally-Unaffected 

We  give  two  examples  of  concrete  systems  for  which  it  is  impossible  to  find  input-output 
functions  such  that  the  input  histories  of  the  acutuators  are  structurally-unaffected.  Roughly 
speaking,  such  input-output  functions  do  not  exist  if  non-determinate  components  see  any 
effects  of  failures.  Both  examples  can  be  analyzed  using  the  perturbational  framework,  which 
is  described  in  Sections  3.4  and  3.5. 

Impossibility  Example  1.  Consider  a  system  comprising  a  source  S',  processing  com¬ 
ponents  F\  F?>  that  perform  some  triplicated  computation,  a  voter  V,  an  unreplicated  pro¬ 
cessing  component  G.  which  is  assumed  not  to  fail,  and  an  actuator  A.  The  output  of  the 
voter  is  sent  to  both  A  and  G.  and  the  output  of  G  is  sent  to  A.  The  failure-free  behavior 
of  the  system  is  represented  by  the  run  in  Figure  3.1.  Variable  V  is  local  to  G.  allowing 
for  some  non-determinacy  or  non-strictness  of  the  component.  Of  course,  the  physical  com¬ 
ponent  being  modeled  by  G  may  really  be  determinate  but  with  behavior  that  depends  on 
aspects  of  the  system  that  have  been  abstracted  from,  such  as  timing  conditions,  load,  or 
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Figure  3.1:  Impossibility  example  1:  failure-free  behavior. 

physical  environment.  We  model  such  components  as  being  non-determinate,  and  we  use 
local  variables  to  represent  the  values  they  produce. 

Consider  a  failure  scenario  in  which  suffers  a  failure  that  causes  it  to  send  an  arbitrary 
value  to  the  voter  and  to  G.  The  behavior  of  the  system  in  that  case  is  represented  by  the 
run  in  Figure  3.2.  As  in  Figure  1.1,  faulty  components  are  distinguished  in  figures  by  dots  on 
their  circumference.  Note  that  the  input  history  of  A  is  not  structurally-unaffected,  because 


Figure  3.2:  Impossibility  example  1:  faulty  run. 


it  contains  variable  Y,  which  is  local  to  G.  and  the  inputs  to  G  have  changed.  In  particular, 
if  the  value  represented  by  Y  can  be  affected  by  the  arbitrary  input  to  G  from  F\,  then  the 
system  is  not  fault-tolerant;  if  it  cannot  be  affected  (e.g.,  because  G  ignores  inputs  from  all 
processes  except  V ),  then  the  system  is  fault-tolerant.  There  is  no  way  to  express  in  this 
framework  whether  or  not  the  value  of  Y  is  sensitive  to  the  new  input,  because  there  is 
no  way  to  express  correlations  between  a  component’s  inputs  and  non-deterministic  choices 
made  by  that  component. 
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Impossibility  Example  2.  The  above  example  involves  a  failure  that  causes  a  message 
to  be  sent  between  components  that  don’t  communicate  in  failure-free  executions.  This  next 
example  does  not  involve  such  communication.  Consider  a  system  with  a  source  S  that 
sends  values  A'|  and  X2  along  separate  channels  C\  and  G2,  respectively,  to  a  component 
G,  which,  for  each  i,  processes  values  received  from  G*  and  sends  the  results  to  actuator 
At.  Note  that  component  G  happens  to  be  non-deterministic,  as  in  the  previous  example; 
variables  l'i  and  Y2  are  local  to  G.  The  failure-free  behavior  of  the  system  is  represented  by 
the  run  in  Figure  3.3. 


Figure  3.3:  Impossibility  example  2:  failure-free  behavior. 


Now,  suppose  each  channel  might  fail,  causing  the  channel  to  corrupt  transmitted  values. 
The  fault-tolerance  requirement  is  that  if  C\  fails,  then  the  input  of  A2  is  unaffected,  and 
likewise  for  C2  and  A\.  The  run  shown  in  Figure  3.4  represents  the  behavior  of  the  system 
when  C\  fails.  Note  that  the  input  history  of  A2  is  not  structurally-unaffected,  because  it 


Figure  3.4:  Impossibility  example  2:  faulty  run. 


contains  variable  12,  which  is  local  to  G,  and  the  inputs  to  G  have  changed.  If  G  processes 
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inputs  from  C\  and  C2  independently,  then  the  value  of  Y2  is  unaffected  by  the  faulty  input 
from  C\  (and  likewise  for  Y\  and  C2),  and  the  system  is  fault-tolerant;  otherwise,  it  is  not. 
However,  in  the  framework  of  Section  3.1,  there  is  no  way  to  indicate  whether  or  not  the 
value  of  Y2  is  affected  by  the  faulty  input. 

Incidentally,  this  example  can  also  be  used  to  illustrate  that  the  condition 

runF(nf)(fs)(A2 )  =  run(nf)(fsOK)(A2) 

is  unnecessarily  restrictive:  when  the  input-output  function  for  G  sees  the  faulty  value 
from  C 1,  it  could  use  Yf  instead  of  Y>  in  its  output  to  A2 ,  even  if  the  represented  value  is 
unchanged.  In  that  case,  runF(nf  )(fs){A2)  ^  run(nf)(fs0K)(A2),  so  the  abstract  input  to 
A2  is  not  structurally-unaffected. 

3.2.2  Non- Trivial  Relationships  between  Original  and  Perturbed 

Values 

We  now  consider  a  different  axis  along  which  the  method  described  in  section  3.1  has  too 
limited  expressiveness.  If  only  failures  that  corrupt  values  arbitrarily  are  considered,  as  in 
all  of  the  above  examples,  then  the  corresponding  perturbations  are  either  “unchanged”  or 
“arbitrarily  changed”.  This  single  bit  of  information  can  be  encoded  in  symbolic  values, 
by  using  the  same  or  different  variable  names,  respectively,  to  represent  those  values.  The 
method  described  in  Section  3.1  uses  such  an  encoding.  However,  non-trivial  relationships 
between  values  in  the  failure-free  and  faulty  computations  cannot  be  expressed  with  that 
method.  We  give  two  examples  that  involve  such  relationships. 

Example  1:  ECC.  Error-correcting  codes  (ECCs)  are  widely  used  when  transmitting  data 
over  unreliable  channels  or  storing  data  on  unreliable  media.  Our  goal  in  this  example  is  to 
characterize  abstractly  the  fault-tolerance  provided  by  an  ECC,  so  that  we  can  analyze  larger 
systems  that  use  ECCs  together  with  other  fault-tolerance  mechanisms.  To  illustrate  the 
importance  of  explicit  perturbations  for  this  purpose,  we  consider  a  simple  system  comprising 
only  a  source  S,  an  encoder  E,  an  unreliable  channel  C.  a  decoder  I),  and  a  receiver  R. 
Constant  function  F  represents  the  ECC  function.  The  failure-free  behavior  of  the  system  is 
represented  by  the  run  in  Figure  3.5.  The  abstract  value  W  is  a  set  of  bit-vectors  (“words”). 

Suppose  the  channel  may  fail  by  corrupting  at  most  A:  bits  of  the  transmitted  value,  and 
that  a  A>bit  ECC  is  used.  The  fault-tolerance  property  of  a  A:-bit  ECC  is:  if  the  input  value 
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Figure  3.5:  Failure-free  behavior  of  system  with  ECC. 

to  the  decoder  differs  in  at  most  k  bits  from  the  input  value  to  the  encoder,  then  the  decoder 
outputs  the  value  given  to  the  encoder.  The  fault-tolerance  requirement  for  this  system  is 
that  the  input  to  receiver  R  is  unaffected  by  failure  of  channel  C. 

To  analyze  this  system,  we  need  to  find  an  input-output  function  for  channel  C  that 
expresses  that  at  most  k  bits  are  corrupted  in  each  output.  For  concreteness,  consider  what 
the  output  of  that  input-output  function  should  be  on  an  input  Y.  The  most  accurate 
symbolic  value  is  something  like  corrupt (Y,  Z),  where  Z  is  a  local  variable  of  the  channel 
(note  that  the  faulty  channel  is  non-deterministic).  We  would  like  the  abstract  value  in  the 
output  of  C  to  denote  the  set  of  words  that  differ  from  Y  in  at  most  k  bits;  if  this  set  cannot 
be  expressed,  then  we  will  not  be  able  to  conclude  that  the  decoder  outputs  the  original 
value  Y . 

Indeed,  this  set  cannot  be  expressed  as  an  abstract  value  in  the  framework  of  Section 
3.1,  since  it  would  require  an  abstract  value  whose  meaning  depends  on  the  interpretation 
of  a  variable  (namely,  F).  However,  this  set  can  be  expressed  directly  in  the  perturbational 
framework  that  will  be  introduced  in  Section  3.4. 

The  perturbational  framework  represents  one  approach  to  dealing  with  this  and  similar 
systems.  An  alternative  approach  is  to  extend  the  definition  of  abstract  values  to  allow 
abstract  values  whose  meanings  depend  on  the  interpretation  of  symbolic  values.  Such 
abstract  values  are  considered  in  Chapter  5. 

Example  2:  Median.  Non-trivial  relationships  between  original  and  perturbed  values 
also  arise  in  replicated  systems  in  which  different  replicas  produce  approximately  equal  (but 
not  necessarily  identical)  values.  For  example,  consider  a  system  with  replicated  sensors 
that,  when  non- faulty,  produce  values  that  are  within  e  of  some  physical  quantity.  Sup¬ 
pose  sensors  fail  by  producing  arbitrary  values.  We  model  this  system  using  a  component 
E.  representing  the  environment,  that  sends  the  actual  value  X  of  the  measured  physical 
quantity  to  components  Si,  representing  the  sensors.  We  take  the  failure- free  behavior  of 
the  system  to  correspond  to  the  ideal  case  in  which  each  sensor  outputs  (actual  value)  A" 
to  component  M.  Component  M  sends  the  median  of  its  inputs  to  component  F .  which 
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computes  some  control  function  and  sends  the  result  to  an  actuator  A.  The  behavior  of  this 
system  is  represented  by  the  run  in  Figure  3.6. 


Figure  3.6:  Idealized  behavior  of  system  with  median. 


Consider  failure  scenarios  in  which  a  majority  of  the  sensors  are  non- faulty  and  produce 
values  within  e  of  their  input  value  X,  but  a  minority  fail  by  producing  arbitrary  values.  It 
is  natural  to  model  both  of  these  deviations  from  the  ideal  behavior  as  perturbations.  The 
analysis  would  then  be  based  on  how  the  median  propagates  perturbations:  if  all  the  inputs 
are  originally  all  equal  (to  A'),  and  a  majority  of  the  inputs  change  by  at  most  s,  then  the 
output  changes  by  at  most  s.  However,  this  analysis  requires  the  ability  to  express  that  the 
perturbed  inputs  to  M  are  within  e  of  (the  concrete  value  represented  by  the  variable)  X. 
As  in  the  previous  example,  there  is  no  way  to  express  this  set  in  the  framework  of  Section 
3.1. 

3.3  Concrete  Model  with  Failures  and  Correlations 

This  section  describes  the  concrete  model  that  is  later  used  in  Section  3.5.2  to  give  a  semantics 
for  our  perturbational  framework.  To  see  why  representation  (3.1)  of  processes  is  inadequate 
for  this  purpose,  consider  a  component  that  non-deterministically  selects  and  outputs  a  single 
number.  Suppose  this  component  can  fail  only  by  crashing.  Thus,  at  worst  a  failure  causes 
the  component  to  output  nothing;  a  failure  cannot  cause  the  component  to  output  a  different 
value  than  in  the  failure-free  computation.  Defining  a  process  p  G  Process  f  that  describes 
this  component  and  reflects  this  fact  is  problematic.  The  root  of  the  problem  is  that  the 
definition  of  Process f  forces  one  to  describe  completely  separately  the  possible  behaviors  in 
the  cases  fail  =  OK  and  fail  =  crash.  Recall  that  the  interpretation  of  fail  =  crash  is  that 
a  crash  occurs  at  some  unspecified  time  during  a  computation;  thus,  the  possible  behaviors 
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of  p(  crash )  are  to:  (1)  non-deterministically  select  and  output  any  number  and  then  crash, 
or  (2)  crash  (before  doing  anything  else)  and  output  nothing. 

If  one  tries  to  determine  solely  from  information  in  p  how  the  component’s  behavior  might 
change  as  a  consequence  of  a  crash  failure,  in  the  absence  of  information  about  correlations 
between  the  behaviors  in  p(OK)  and p(crash)%  the  only  safe  (conservative)  assumption  is  that 
each  possible  behavior  of  p(OK)  may  be  changed  by  a  crash  into  any  possible  behavior  of 
p(crash).  With  this  conservative  approximation,  it  would  appear  that  a  crash  could  change 
the  value  output  by  p.  while  in  fact,  it  cannot. 

The  root  of  the  innacuracy  is  that  elements  of  ProcessF  do  not  describe  correlations 
between  the  possible  behaviors  of  a  component  in  the  presence  and  absence  of  failures. 
Continuing  the  above  example,  we  would  like  to  express  that  a  behavior  in  which  the  process 
outputs  a  number  n  is  changed  by  occurrence  of  a  crash  only  into:  (1)  a  behavior  in  which 
the  process  outputs  n  and  then  crashes,  or  (2)  a  behavior  in  which  the  process  crashes  (before 
doing  anything  else)  and  outputs  nothing.  So,  in  this  example,  the  behavior  in  which  the 
non-faulty  process  outputs  n  is  related  (by  a  crash)  to  the  behaviors  in  which  the  faulty 
process  outputs  n  or  outputs  nothing  but  is  not  related  to  the  behavior  in  which  the  faulty 
process  outputs  ri' .  where  n'  ^  n. 

To  reflect  such  correlations,  we  now  model  a  process  as  a  function  of  type  Fail 
Set(IRProcess  x  IRProcess ).  The  interpretation  of  ( irp,  irp ')  G  p(fail )  is  that  the  process 
can  behave  like  irp  in  the  absence  of  failures,  and  that  failure  fail  can  cause  p’s  behavior  to 
change  from  that  of  irp  to  that  of  irp' .  We  impose  two  sanity  conditions.  The  first  condition, 
origIndepc ,  requires  that  the  set  of  failure- free  behaviors  of  the  process  be  independent  of 
the  failure: 


origIndepc(p )  =  (V '/ ‘ail x,  fail 2  €  dom(p)  :  vri(p(/a«/1))  =  tti (p(fail2))),  (3.18) 

where  wf  is  the  pointwise  extension  of  tti  from  tuples  to  sets  of  tuples. 

An  input  to  a  pair  (irp,  irp')  G  p(fail )  is  a  pair  (a,  a')  of  concrete  histories,  where  a  is  the 
input  to  irp  in  a  failure-free  execution  of  the  system,  and  a'  is  the  input  to  irp'  in  a  faulty 
execution.  A  pair  (irp,  irp')  G  p(fail)  of  input-restricted  processes  is  enabled  only  if  both 
processes  are  enabled  on  their  respective  inputs;  this  convention  is  reflected  in  the  definition 
of  cruns Fq  below.  Thus,  to  ensure  that  the  set  of  failure-free  behaviors  is  independent  of  the 
failure-mode,  we  must  require  that  enabledness  of  each  input-restricted  process  in  W[(p(fail))) 
is  independent  of  the  component’s  inputs  in  the  faulty  computation.  This  is  ensured  by  the 
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second  sanity  condition,  consistent ,  defined  by 

consistent(p )  =  ( X/fail  G  dom(p )  :  (V(cr,  cr')  G  Chain(CHist)  x  Chain(CHist)  :  (3.19) 

(V*rp  G  Ti(p(fail))  :  enabled (irp,  a)  =>  (3  irp'  G  IRProcess  : 

( irp ,  irp')  G  p(fail)  A  enabled  (irp1 ,  cr'))))) . 

Thus,  processes  are  elements  of 

Process  pc  =  {p  G  Fail  —  Set(IRProcess  x  IRProcess)  |  A  origIndepc(p)  (3.20) 

A  consistent  (p)}. 

The  possible  behaviors  of  a  concrete  system  np  G  Name  Process fc  in  failure  scenario 
/s  G  FS(np)  are  represented  by  a  set  crunspc(np)(fs)  ^  CRun  x  CRun.  The  interpretation 
of  (cr,  cr')  G  crunspc(np)(fs )  is  that  cr  is  a  possible  failure- free  run  of  the  system  and  that 
the  failures  in  fs  can  cause  the  system’s  behavior  to  change  from  cr  to  cr' .  Formally, 

cruns Fc(np)(fs)  =  {{en,  cr 2)  G  CRun  x  CRun  \ 

(3 h  G  Name  — *  IRProcess  x  IRProcess  :  (Vrc  G  Name  : 

A  h(a;)  G  np(x)(fs ) 

A  (Va;  G  {1,2}  :  cra  G  cruns (Xx :  Name.  {na(h(x) )}))}• 

(3.21) 

The  set  of  failure- free  runs  of  the  system  is  given  by  7fi( cruns Fc(np )(fs))-  Conditions 
origIndepc  and  consistent  together  ensure  that  the  set  of  failure-free  runs  is  independent  of 
the  failure  scenario: 

(Vnp  G  Name  — >  Process fc  '■  (yfs1Js2  E  FS(np)  :  ^  ^ 

Wl(crunsFC(np)(fs1))  =Wl(crunsFG(np)(fs2)))). 

3.3.1  Running  Example 

Definitions  of  CSrcpci  CVoterFci  and  CActpc •  We  do  not  consider  failures  of  the  source 
S',  voter  V.  or  actuator  A.  so  the  elements  of  Process Fc  that  describe  those  components  are 
closely  related  to  the  processes  CSrc,  CVoter ,  CAct  G  Process  defined  in  Section  2.1.3.  The 
function  nonfaulty  FC  G  Process  — >  Process  fc  converts  an  element  of  Process  into  a  failure- 
free  element  of  Process  fc'- 

nonfaulty  FC  (p)  =  ( A  fail :  {  OK } .  (J  {( irp  ,irp)}).  (3.23) 

irpEp 
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So,  for  dests,srcs  G  Set(Name)  and  dest  G  Name , 


CSrcpc{dests ) 

=  nonfaulty  FC{CSrc{dests)) 

(3.24) 

C Voter Fc{srcs,  dest) 

=  nonfaulty FC{C Voter {srcs,  dest)) 

(3.25) 

CActpc 

=  nonfaulty  FC{C Act), 

(3.26) 

(3.27) 

Definition  of  CCompFC.  For  processors  F\  f}>  and  G  \  CJ-A .  we  assume  the  same  failure 
modes  as  in  Section  3.1.3.  These  components  are  represented  by  appropriate  instances  of 
CComp  FC ,  where  for  src ,  dest  G  Name  and  <p  G  N  — *■  N, 

CComp  FC  (src ,  dest,  </>)  =  (A fail :  {Oif ,  valFail}.  (3.28) 

if  fail  =  OK  then 

C  Comp  {src ,  dest,  <p )  x  C  Comp  {src ,  dest,  </>) 
else  C Comp  {src,  dest,  <±>)  x  CValFail{CVal,  {dest})), 

where  CComp  and  CValFail  are  defined  by  (2.18)  and  (3.11),  respectively.  The  clause  for 
fail  =  OK  has  been  simplified  using  the  fact  that  CComp  {src,  dest,  <j>)  is  a  singleton  set,  so 
there  is  no  need  to  explicitly  restrict  to  pairs  (cr,  cr ')  such  that  cr  =  cr1 . 

The  concrete  runs  of  this  system  can  be  computed  using  crunspc ■  Let  npFC  be  the 
obvious  mapping  from  Name  to  Process  pc'-  nPrFc(S)  =  CSrcpc{{Fi,  F2,  F3}),  etc.  As  before, 
let  /.s  |  be  the  failure  scenario  in  which  only  F\  fails.  It  is  easy  to  check  that 

cruns FC {npFC){fs OK)  =  (J  (crre{i),  crre{i)) 

ie  N 

cruris Fc {np rFC ) {fs x )  =  |J  (crre{i),  crr^{i,  cv)). 

iEN,evECVal 

3.4  Perturbational  Framework:  Representation  of 
Runs 

The  set  of  possible  behaviors  of  a  system  in  a  particular  failure-scenario  is  represented  by 
a  single  run.  That  run  represents  the  system’s  failure-free  behaviors  as  well  as  its  possible 
behaviors  in  the  given  failure  scenario.  For  example,  in  Figure  1.1,  the  failure-free  behavior 
is  described  by  the  original  parts  of  the  perturbed  ms-atoms,  while  the  new  ms-atoms  and 
the  perturbations  in  the  perturbed  ms-atoms  together  indicate  how  the  failure-free  behavior 
is  changed  by  failures. 
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More  generally,  we  introduce  in  this  section  a  new  set  Runpc  of  runs  extended  with 
perturbations  and  new  ms-atoms.  The  meaning  of  an  element  of  Runpc  is  a  set  of  pairs  of 
concrete  runs.  For  a  run  r  computed  for  a  system  in  failure  scenario  fs,  the  interpretation 
of  a  pair  (cr,  cr ')  in  the  meaning  of  r  is  that  the  system’s  behavior  may  change  from  cr  to 
cr'  as  a  consequence  of  the  failures  in  fs . 

The  definition  of  RunFc  is  analogous  to  definition  (2.29)  of  Run .  except  that  the  under¬ 
lying  set  of  ms-atoms  is  extended  with  perturbations  and  new  ms-atoms.  The  extended  set 
of  ms-atoms  is 

Lpc  Lper  bl  L new  i  (3.29) 

where  the  sets  Lper  of  perturbed  ms-atoms  and  Lnew  of  new  ms-atoms  are  defined  by 

Lper  =  Mul  x  Val  x  A Mul  x  A  Val  x  Tag  (3.30) 

Lnew  =  Mul  x  Val  x  Tag.  (3.31) 

where  A  Mul  and  A  Val  are  perturbations  to  the  multiplicity  and  value,  respectively.  Pertur¬ 
bations  are  represented  similarly  to  values:  an  abstract  part  describing  the  possible  changes 
in  value,  and  a  symbolic  part  representing  the  perturbed  value  itself,  i.e.,  the  concrete  value 
in  the  faulty  execution.  Possible  changes  are  described  by  binary  relations  over  CVal:  if  the 
relation  relates  cv  to  cv',  then  the  concrete  value  can  change  from  cv  to  cv' .  Thus,  possible 
changes  in  value  are  represented  by  elements  of  a  new  set  A  A  Val .  and  each  element  of  A  AVal 
is  interpreted  as  a  binary  relation  over  CVal.  Formally,  the  interpretation  of  A  AVal  is  given 
by  a  function  []Aj4Va;  €  A  AVal  — >  Set(CVal  x  CVal). 

For  convenience,  we  assume  that  id  G  A  AVal  denotes  the  identity  relation: 

[^Ia^vo*  =  {(x^x>)  ^  CVal  x  CVal  \  x  =  x'}  (3.32) 

and  that  the  “top”  element  TAy  G  A  AVal  denotes  an  arbitrary  change: 

VMAAVal  =  CVal  ><  CVaL  (3-33) 

For  any  a  G  AVal,  we  define  aA  G  A  AVal  by 

[aA]Aj4yai  =  [a]Uya/  x  MuVaC  (3.34) 

The  notational  conventions  introduced  for  Val  are  also  used  for  A  Val]  for  example,  we  might 
write  TAy  to  denote  {(_,  TAy)}  G  AVal.  In  the  ms-atom  on  edge  (Fi,  G  i)  in  Figure  1.1,  the 
perturbation  TAy  indicates  that  the  data  sent  from  F\  to  G  \  may  change  arbitrarily  when 
F\  fails. 
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By  analogy  with  definition  (2.39)  of  AMul ,  we  take  AAMul  to  be  an  appropriate  subset 
of  AAVal,  namely,  elements  that  relate  natural  numbers  only  to  natural  numbers: 

AAMul  =  {<5a  G  AAVal  |  (V(x,  y)  G  [<5a]A^ya/  :  x  G  N  =>-  y  G  N)}.  (3.35) 

Here,  6a  is  just  a  bound  variable,  like  a  in  (2.39);  we  include  6  in  the  name  only  as  a  reminder 
that  this  variable  ranges  over  some  set  of  perturbations.  Note  that  for  a  G  AMul ,  aA  (defined 
by  (3.34))  is  in  AAMul.  For  example,  in  the  ms-atom  on  edge  (Fi,Gi)  in  Figure  1.1,  the 
superscript  %  G  AAMul  indicates  that  the  number  of  messages  sent  from  F\  to  G\  may 
change  arbitrarily  when  F\  fails. 

The  symbolic  and  abstract  parts  of  a  perturbation  are  aggregated  in  the  same  way  as  the 
symbolic  and  abstract  parts  of  a  value:  by  analogy  with  definitions  (2.31)  and  (2.38)  of  Val 
and  Mul ,  respectively,  we  define 

A  Val  =  Vfin(SVal  x  AAVal)  \  {0}  (3.36) 

A  Mul  =  Vfin(SVal  x  AAMul)  \  {0}.  (3.37) 

Histories  and  runs  are  defined  as  before,  except  over  LFC  instead  of  L: 

Histpc  =  Name  — >  POSet(Lpc)  (3.38) 

Run  pc  =  Name  — >  Histpc ■  (3.39) 

Notational  Conventions.  We  sometimes  write  a  ms-atom  (mul,  val,  0)  G  Lnew  as  valmul . 
Similarly,  we  sometimes  write  a  ms-atom  (val,  mul,  6 mul,  6 val,  0)  G  Lper  as  valmut[6valSmut]. 
We  sometimes  elide  a  change  of  (_,  id);  for  example,  the  ms-atom  (val,  mul,  id,  id,  0)  G  Lper 
may  be  written  valmul\\.  The  empty  brackets  are  retained  to  distinguish  this  from  the 
shorthand  for  ms-atoms  in  Lnew. 

To  illustrate  these  definitions,  consider  the  perturbed  ms-atom  (X  :N)*[(X  :  id)'ld].  The 
original  part  of  this  ms-atom  represents  an  arbitrary  sequence  of  messages  all  containing  the 
same  number,  represented  by  X.  In  the  perturbed  behavior,  the  same  number  of  messages 
are  sent — because  the  change  in  multiplicity  is  the  identity  relation  id — and  the  messages 
all  contain  the  original  value — because  the  perturbation  contains  the  same  variable  X,  and 
because  the  change  in  value  is  the  identity  relation  id.  In  short,  the  messages  represented  by 
this  ms-atom  are  unchanged.  Note  that  the  ms-atom  (X :N)*[«2ld]  has  the  same  meaning. 

As  another  example,  consider  the  ms-atom  X  :N[F:TAy],  The  original  part  of  this  ms- 
atom  represents  a  single  message  containing  a  number,  represented  by  X.  In  the  perturbed 
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behavior,  the  multiplicity  is  unchanged — because  the  change  in  multiplicity  is  the  identity 
relation  id ,  which  is  elided — but  the  value  changes  arbitrarily  and  is  now  represented  by  F. 
As  a  variation  on  this,  consider  the  ms-atom  X :  N [A'  :Tav].  Although  the  abstract  part  of 
the  perturbation  to  the  value  allows  an  arbitrary  change,  the  symbolic  part  shows  that  the 
value  is  still  represented  by  A',  so  this  ms-atom  has  the  same  meaning  as  X :  N [A' :  id]. 

As  a  final  example,  consider  the  ms-atom  N[zd?A].  The  original  part  of  this  ms-atom 
represents  a  single  message  containing  a  number.  When  failures  occur,  the  data  in  the 
message  is  unchanged — because  the  change  in  value  is  the  identity  relation  id — but  the 
number  of  messages  might  change  to  zero  or  remain  at  one,  since  [?A]AuvaZ  contains  the 
pairs  (1,  0}  and  (1, 1).  In  short,  there  is  a  possibility  that  the  original  behavior  is  unchanged, 
but  there  is  also  a  possibility  that  no  message  is  sent. 

3.4.1  Running  Example 

The  replicated  pipeline’s  behavior  in  failure  scenario  fs0K  (recall  that  fs0K  is  defined  on 
page  50)  is  represented  by  almost  the  same  run  as  in  Figure  2.2;  the  only  difference  is  that 
each  ms-atom  valmul  is  replaced  with  valmul\\.  In  other  words,  all  of  the  perturbations  are 
the  identity  perturbation. 

The  system’s  behavior  in  failure  scenario  fs ,  is  represented  by  the  run  in  Figure  3.7. 


Figure  3.7:  Run  for  running  example  when  component  F\  fails. 


3.5  Perturbational  Framework:  Representation  of 
Components 

Components  are  represented  by  input-output  functions  over  the  extended  type  LFC  of  ms- 
atoms.  In  addition  to  the  familiar  sanity  requirement  of  tag-uniformity,  we  impose  a  sanity 
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requirement  analogous  to  origIndepc  in  the  definition  (3.20)  of  processes.  Thus, 

IOFfc  =  {/  G  Fail  ( Histpc  — »  Histpc)  |  A  tagUniform FC(f)  (3.40) 

A  origIndep(f )}, 

where  tagUniform  FC(f)  ensures  that  renaming  of  tags  in  the  argument  of  /  causes  no  change 
in  the  output  of  /  except  possibly  renaming  of  tags,  and  origlndep  ensures  that  for  all 
fail  G  dom(f ),  the  original  part  of  f(fail)' s  output  (i.e.,  the  part  ignoring  changes  and  new 
outputs)  depends  only  on  the  original  part  of  f(fail)' s  input.  Note  that  no  analogue  of 
consistent  is  needed,  because  input-output  functions  are  not  input-restricted:  they  are  total 
functions.  Formally,  the  definition  of  tagUniformFC  is  definition  (2.45)  of  tagUniform  with 
Hist  replaced  with  Histpc'- 

tagUniform FC(f)  =  (Vmi,m2  G  HistFC  :  int  =HittFC  in2  =4>  /(mi)  =mstFC  f(*n2)),  (3.41) 

where  =mstFC  denotes  equality  of  histories  up  to  renaming  of  tags;  the  definition  of  =HistFC 
is  similar  to  the  definition  of  =Hist- 

The  definition  of  origlndep  uses  a  function  orig  G  Set(LFC )  — >  Set(L)  that  projects  the 
original  part  of  a  set  of  ms-atoms.  Roughly,  orig(S)  is  obtained  from  S  by  eliminating  the 
perturbations  from  the  perturbed  ms-atoms  and  dropping  the  new  ms-atoms  entirely.  The 
only  technicality  is  that  retagging  may  be  necessary  to  avoid  “collisions” .  This  is  similar  to 
the  retagging  needed  in  the  definition  of  ap Op  in  Section  2.3.2.  Here,  a  collision  occurs  if  the 
set  S  of  ms-atoms  contains  two  elements  that  differ  only  in  their  perturbations;  eliminating 
the  perturbations  would  cause  those  two  elements  to  appear  identical.  To  avoid  this  collision 
of  identities,  we  would  change  the  tag  in  one  of  those  ms-atoms.  Using  orig ,  the  definition 
of  origlndep  is  straightforward: 

origlndep(f)  =  (Vfail  G  dom(f)  :  (Viiq,  m2  G  Histpc  :  (3.42) 

orig  {ini)  =hm  ong(in2) 

=4>  ~ong(f  (fail) (int))  =hm  ~ong(f(fail)(in 2)))), 

where  orig  G  Histpc  — >  Hist  is  the  extension  of  orig  from  Set(Lpc)  to  Histpc- 

For  future  reference,  we  remark  that  there  is  a  family  of  Injections  naturally  associated 
with  orig.  In  particular,  for  S  G  Set(LFC ),  borig(S)  is  a  bijection  from  S  fl  Lper  to  orig(S) 
that  preserves  values  and  multiplicities;  in  other  words,  borig(S )  satisfies 

Abong(S)  e  (S  n  Lper)h^  orig(S) 

A  (W  G  S'  fl  Lper  :  TTi(borig(S)(C))  =  7Ti (£)  A  7r 2(borig(S)(E))  =  7r2(A) ), 


(3.43) 
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where  for  sets  S  and  T,  S  T  is  the  set  of  bijections  from  S  to  T.  This  family  of  bijections 
will  be  convenient  for  formulating  the  semantics  in  Section  3.5.2. 

The  behavior  of  a  system  nf  G  Name  Process Fc  is  represented  by  the  run  runFc(nf)(fs) 
]fp(stepF(nf  ,/s)),  if  it  exists.  The  equality  =RUnFC  011  Ruripo  is  the  pointwise  extension  of 
the  equality  =mstFC  011  Histpc ■  Thus,  a  fixed-point  of  step F(nf ,  fs)  is  an  element  r  of  Runpc 
satisfying  step F(nf,  fs)(r)  =RunFC  r. 

3.5.1  Running  Example. 

The  fault-tolerance  requirement  for  the  replicated  pipeline  is  the  same  as  in  Section  3.1.3. 
In  the  perturbational  model,  the  fault-tolerance  requirement  still  has  the  form  (3.17),  but 
bo  is  defined  in  terms  of  perturbations.  In  particular,  bo(fs,r )  =  unchanged  (r  (A)),  where 
the  predicate  unchanged  on  histories  is  the  pointwise  extension  of  the  predicate  unchanged 
on  posets  of  ms-atoms.  Roughly,  the  predicate  unchanged  on  posets  of  ms-atoms  asserts 
that  the  poset  is  totally-ordered  and  contains  no  new  ms-atoms,  and  that  the  perturbations 
to  the  value  and  multiplicity  in  each  perturbed  ms-atom  in  the  history  are  unchanged.  A 
perturbation  is  unchanged  if  the  abstract  part  of  the  perturbation  is  id  and  the  symbolic 
part  of  the  perturbation  does  not  “force”  the  value  to  change,  i.e.,  the  symbolic  value  either 
contains  a  wildcard  or  contains  all  of  the  symbolic  values  in  the  original  value.  The  condition 
that  the  poset  be  totally  ordered  assures  that  messages  occur  in  the  same  order  in  the  original 
and  perturbed  computations.  Formally, 

unchanged((S, -<))  =  A  totalOrd((S,  -*<))  (3.44) 

A  (SnLnew)  =  0 

A  ( \/(mul ,  val ,  Smul ,  Sval,  tag)  G  (S  n  Lper)  : 

unchanged Val{val,  dual)  A  unchanged Val(mul,  Smul)) 

where 

totalOrd((S ,  -<))  =  (Va\  yES\  x  =  y\Jx<y\Jy~<x)  (3.45) 

and  for  val  G  Val  and  Sval  G  A  Val, 

unchanged Val(val,  Sval)  =  A7T2 (Sval)  =  {id}  (3.46) 

A  V  _  G  Wi(Sval) 

VWi(val)  C  7fi (Sval). 

The  meaning  of  unchanged  is  discussed  further  in  Section  3.5.2. 
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Definitions  of  Srcpc  and  Actpc •  Source  S  ignores  its  inputs  and  doesn’t  fail,  so  the 
input-output  function  for  it  is  nearly  the  same  as  (2.50);  the  only  modification  to  the  source’s 
outputs  is  that  they  now  contain  id ,  indicating  that  they  are  unchanged: 

SrcFc(dests)  =  (A  fail:  {OK}.  (Ah:  Histpc-  (Arc:  Name.  (3-47) 

if  x  E  dests  then  ({(1,  X :  N,  _:  id,  _:  id,  0)},  0) 
else  (0,  0)))). 

Actuator  A  produces  no  outputs,  so  the  input-output  function  for  it  is  simply 

Actpc  =  (A fail:  {OK}.  (Ah:  Histpc-  (Ax:  Name.  (0,0)))).  (3.48) 

Definition  of  CompFC.  Processors  Ft  F]>  and  G\-G$  are  represented  by  appropriate  in¬ 
stances  of  CompFC.  Informally,  for  src ,  dest  E  Name  and  op  E  Sym,  Comp FC(src,  dest,  op) 
normally  applies  operator  op  to  each  input  from  src  and  sends  the  results  to  dest.  If  a  “value 
failure”  occurs  (i.e. ,  fail  =  valFail ),  then  the  perturbations  to  those  outputs  are  TAV. 

The  application  of  an  operator  to  a  perturbed  or  new  value  is  handled  by  the  function 
apOpF.  A  “perturbed  or  new  value”  is  represented  by  an  element  of  (Val  x  A Val)  U  Val, 
elements  of  Val  x  A  Val  correspond  to  values  in  perturbed  ms-atoms,  and  elements  of  Val 
to  values  in  new  ms-atoms.  For  op  E  Sym ,  aval  E  AVal ,  and  x  E  Val  U  Val  x  A  Val, 
apOpF(OK)(op,  aval)(x )  is  the  result  of  applying  the  operator  op  to  each  symbolic  value  in 
x ,  provided  the  associated  abstract  value  is  aval ;  as  in  definition  (2.53)  of  apOp,  aval  can  be 
thought  of  as  representing  the  domain  and  range  of  op.  Formally,  apOpF(OK)  is  defined  by 

apOpF(OK)(op,  aval)(x)  =  (3.49) 

match  x  with 

|  ( val,  6 val )  — ( apOp(op ,  aval) (val),  apOpA(op ,  aval) (val,  8 val)) 

|  val  — >  apOp(op,  aval) (val) 

where  apOp(op,  aval)  is  defined  in  (2.53),  and  apOpA  is  defined  similarly  by 

apOpA(op,  aval) (val,  6 val)  =  [J  {if  7r2( val )  =  {aval}  A  8a  =  id  then  (3.50) 

(s,6a)eSvai  if  s  =  _  then  (_,  id)  else  ( op(s ),  id) 

else  (_,TAy)}. 

Behavior  of  processors  that  suffer  value  failures  is  described  using  ap Op F (valFail)  rather 
than  apOpF(OK).  For  op  E  Sym,  aval  E  AVal,  and  x  E  Val  U  (Val  x  AVal), 


ap  Op  F  ( valFail  )(op,  aval )  (x ) 
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is  the  result  of  applying  the  operator  op  to  each  symbolic  value  in  the  “original  part”  of  x, 
(i.e.,  the  first  component  of  a  perturbed  value,  and  no  part  of  a  new  value)  to  obtain  the 
original  part  of  the  output.  The  perturbation  or  new  value  in  the  output  is  simply  “top”, 
representing  arbitrary  values.  Formally, 

apOpF(valFail)(op ,  aval)(x)  =  match  x  with  (3.51) 

|  (val,8val)  — >  (apOp(op,  aval)(val),  {(_,  TAy)}) 

|  val  -»■  {(_,Ty)}. 

Given  the  definition  of  apOpF ,  the  definition  of  CompFC  is  simple: 

Comp FC(src ,  dest,  op)  =  ( Xfail :  {OK,  valFail).  (\h:  Histpc-  {Xx:  Name.  (3.52) 

if  x  =  dest  then  apOpF(fail)(op,N)(h(src)) 
else  (0,0)))), 

where  apOpF(fail)(op,  aval )  is  the  extension  of  apOpF(fail)(op,  aval)  to  posets  of  ms-atoms: 
the  extension  is  done  by  letting  apOpF  operate  on  the  value  and  perturbation  in  each  per¬ 
turbed  ms-atom  and  on  the  value  in  each  new  ms-atom.  Retagging  may  be  necessary  in  the 
extension  to  avoid  “collisions” . 

Definition  of  Voter Fq.  The  voter  is  represented  by  an  input-output  function  that  tallies 
the  original  ballots  (i.e.,  ignoring  changes  and  new  ms-atoms)  to  obtain  an  original  result 
t0 ,  then  tallies  the  changed  and  new  ballots  to  obtain  a  perturbed  result  tp,  then  compares 
these  two  results  to  determine  the  perturbation  in  its  output.  The  function  tally  defined  in 
(2.56)  is  used  in  both  cases  to  tally  the  ballots.  Recall  that  the  first  argument  of  tally  is  a 
function  used  to  extract  ballots  from  a  poset  of  ms-atoms.  Ballots  based  on  the  original  parts 
of  the  input  ms-atoms  are  extracted  with  the  function  ballot 0  =  ballot  o  orig,  where  ballot 
is  defined  in  (2.55)  and  orig  is  defined  following  (3.42).  Ballots  reflecting  perturbations  and 
new  ms-atoms  in  the  input  are  extracted  with  the  function  ballotp  defined  in  Figure  3.8.  As 
in  definition  (2.55)  of  ballot ,  we  approximate  rather  than  accumulate  sets  of  possibilities  in 
the  definition  of  ballotp. 

The  definition  of  Voter Fc  appears  in  Figure  3.9.  If  the  input  history  contains  no  per¬ 
turbed  ms-atom  from  some  component  in  src,  then  there  is  no  point  in  tallying  the  original 
ballots,  so  we  simply  let  t0  equal  _L.  Similarly,  if  the  input  history  contains  no  perturbed 
or  new  ms-atom  from  some  component  in  src ,  then  there  is  no  point  in  tallying  the  per¬ 
turbed/new  ballots,  so  we  simply  let  tp  equal  _L.  It  follows  from  these  comments  (and  the 
fact  that  tally  never  returns  _L)  that  if  tp  =  _L,  then  t0  =  _L;  this  justifies  the  comment  “this 
can’t  happen”  in  the  definition  of  Voter Fc- 
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ballot  P((S ,  ~^s) )  =  match  S  with 

|  {{ mul ,  val ,  Smul,  Sval ,  tog)}  — > 

let  =  if  7 T2(Smul)  =  {id}  then  mul  else  {(_,*)} 
in  match  Sval  with 

|  {stoa}  — >  if  7r2(val)  =  {aval}  A  Sa  =  id  then  (mul1,  s:  aval) 
else  (mul1 ,  s:Tv) 

|  _  — >  (*  approximate  *) 

(mul1,  Ty) 

|  {(mul,  val,  tag)}  — »  match  val  with 

|  {s:a}  — >  (mul,  s:a) 

|  _  — >  (*  approximate  *) 

(mul,  Ty) 

|  _  — >  (*  approximate  *) 


Figure  3.8:  Definition  of  ballotp. 
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Voter pc(srcs,  dest,  aval)  =  (3.53) 

(A  fail:  {OK}.  ( \h\Histpc ■  if  Arc :  Name. 
if  x  /  dest  then  (0,  0) 

else  let  t0  =  if  ( 3src  E  srcs  :  (tv\ (h(src))  n  Lper)  =  0)  then  _L 
else  tally  (ballot  0,  srcs,  aval,  h) 
in  let  t.p  =  if  (3 src  E  srcs  :  7Ti (h(src))  =  0)  then  _L 
else  tally  (ballot  p,  srcs,  aval ,  h) 
in  match (t0,tp)  with 

K±a)-^(0,0) 

|  (( mul ,  val),  _L)  >  (*  this  can’t  happen  *) 

(0,0) 

|  (_L,  (mul,  val))  — >  ({(mul,  {val},  0)},  0) 

|  ((mul,  val),  (mul1 ,  val'))  — > 

let  6 val  =  if  iri(val)  =  tti (val1)  A  iri(val)  ^  _  then  (m (val),  id) 
else  (_.  A\  ) 

in  let  6 mul  =  if  definite(mul)  A  definite(mul')  then  (_,  id)  else 
in  ({(mul,  {val},  {Smul},  {6 val},  0)},  0)))), 


Figure  3.9:  Definition  of  Voter Fc- 
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3.5.2  Semantics 

Semantics  of  Posets  of  ms-atoms.  The  definition  of  [] poset(LFC)  's  a  straightforward 
extension  of  definition  (2.65)  of  l]POset(L)-  F°r  P  £  interp(Sym ), 

1(5,  -<)iposet$^FC)  =  Seq(CVal)  x  Seq(CVal)  |  (3.54) 

(3 g  G  dom(cr)  S  fl  Lper  :  (3 g'  G  dom(a')  S  : 
compatpPOSet(LFc)(S ,  -<,  a,  a',  3,  #')}, 

where  the  correspondences  g  and  g'  must  satisfy  conditions  related  to  the  original  part  of 
S',  the  new  part  of  S,  the  perturbations  in  S,  and  the  ordering  these  four  conditions  are 
formalized  as  the  four  conjuncts,  respectively,  in 

compatpPOSet(LFc)(S ,  a,  o',  g ,  g')  =  (3.55) 

A  compatpPOSet(L)(orig(S ),  ^onj(S)(A  n(Tpe?.  x  Lper)) ,  (j:  borig  (S)°g) 

A  compatpPOSet(L)(S  n  Lnewi  A  6l(-^'new  X  &  i  9  ) 

A  (W  G  S  n  iper.  :  A  (Vi  G  :  (Vi7  G  0,MM'(O  :  compa^v-al(7r4(£),  cr[i],  crr [?;'] ) ) ) 

A  compatpAVal(n3(e),  \ginv(£)\ ,  |/‘m,(0|)) 

A  (V(^,4)  GA:  ^-((d)  ^(N)  i/""'(l2)); 

where  borig(S )  is  defined  by  (3.43)  and  borig(S )  is  the  pointwise  extension  of  borig(S)  from 
S  n  Lper  to  Order (S  fl  Lper ),  and  the  predicate  used  in  the  third  conjunct  to  check  that  the 
perturbations  in  each  ms-atom  relate  the  original  values  in  a  to  the  perturbed  values  in  a'  is 

compatpAVal(Sval ,  cv,  cv')  =  (3(s,<5a)  G  Sval  :  A  (cv,  cv ')  G  [^o-]Auva;  (3.56) 

A  s  =  _  V  cv'  =  p(s)), 

where  J>  is  defined  by  (2.61).  It  is  easy  to  check  that  []posgt(£«7)  is  independent  of  tags. 

Semantics  of  Histories.  The  meaning  of  histories  is  a  straightforward  extension  of  the 
meaning  of  posets  of  ms-atoms.  For  p  G  interp(Sym ), 

\hfmsiFC  =  {(<*,  ch')  G  CHist  x  CHist  |  (3.57) 

(Mx  G  Name  :  (ch(x),  ch\x ))  G  lh(x)fP0Set(LFc])}, 

Note  that  is  monotonic  in  p.  i.e., 


( V p\ ,  p2  G  interp(Sym)  :  (\/h  G  Histpc  : 

Pi  <interp  P2  =4-  1^1  HistFC  —  M  Histpc^ ' 


(3.58) 
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Semantics  of  Input-Output  Functions.  By  analogy  with  definition  (2.68)  of  \Ziof,  we 
define  for  p  G  Process pc,  f  G  lOFpc ,  pc  G  Interp(Con),  and  Ivar  C  Far, 

P  l=£?"  /  =  (3-59) 

A  dom(p )  =  dom(f) 

A  (V/ai/  G  dom{p)  :  (V((dp,  *r),  (dp7,  ir'))  G  p(fail )  : 

(3g  G  ( CHist  x  CHist)  — interp(lvar)  :  (\/pe  G  Interp(Var  \  Ivar )  : 

(Vm  G  Histpc  :  (Vpe/i  £  ir  x  ir'  :  pe/i  G 

=>  (dpMpch)),  dp\Tr2(pch)))  G  [/(/a*/)(m)]^;^(pcA) )))))). 

Semantics  of  Systems.  For  nf  G  Name  — >  Process  pc  and  pa  G  interp(Con),  the  se¬ 
mantics  of  the  abstract  system  (nf,pa)  is  given  by  whose  definition  is  the  same  as 

definition  (2.69)  of  except  with  IZ/or  replaced  with  £iopJ!(, . 

Semantics  of  Runs.  By  analogy  with  definition  (2.71)  of  []fi  *  we  define  for  pa  G 
interp(Con), 

MPRunFC  =  {(cri,  cr2)  G  Cfiun  x  CRun  |  (3.60) 

(3pc  G  Interp(Con,  pa)  :  (3/^  G  Interp(Var)  : 

(V:r  G  Name  :  (cr^x),  cr2{x))  G  ^(a:)]^^)))}. 

Semantics  of  unchanged.  Recall  that  for  h  G  Histpc ,  unchanged(h )  means  that  the  con¬ 
crete  behavior  represented  by  /?,  does  not  change,  i.e. ,  is  the  same  in  the  failure-free  and 
faulty  executions.  This  idea  is  formalized  by  the  theorem 

(Vh  G  Histpc  '■  (V/9  G  interp(Sym)  : 

unchanged(h )  >  \hfmstFo  =  (s)jp^ {(m a)}). 

As  mentioned  in  Section  3.1.3,  unchanged  says  nothing  about  the  relative  order  of  messages 
on  different  channels;  indeed,  it  couldn’t  possibly  say  anything  about  inter-channel  orderings, 
since  Histpc  and  CHist  do  not. 

3.5.3  Soundness 

Soundness  plays  the  same  role  for  the  perturbational  framework  as  for  the  framework  of 
Chapter  2  (see  the  comments  in  the  beginning  of  Section  2.4).  The  development  here  is 
closely  analogous  to  the  development  in  Section  2.4.2. 
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For  convenience,  we  again  consider  only  finite  runs.  Let  crunsFnc(np)(fs )  contain  the 
pairs  of  finite  runs  in  cruris  h'c(np)ifs).  Soundness  is  established  by  the  following  theorem, 
whose  proof  is  almost  identical  to  the  proof  of  Theorem  2.2. 

Theorem  3.1.  For  all  np  G  Name  Process  Fc,  all  nf  G  Name  — >  IOFfc ,  all  pa  G 
interp(Con ),  all  fs  G  FS(np ),  and  all  iIp  G  N,  if  np  \ZSy,FC  (nf,pa),  andifr  =  step F(nf ,  fs)'&  (- lRun) 
is  a  fixed-point  of  stepF(nfJs ),  then  cruns%(np)  C  [r]^rejfp. 

Proof:  Let  pc  G  Interp(Con,  pa)  witness  the  existential  quantification  in  np(rc)  C59S 
( n.f  ■  Pa ) ■  Consider  any  fs  G  FS(np).  The  definition  of  \Zioffc  ensures  fs  G  FS(nf).  Con¬ 
sider  any  pcr0  G  cruris Fnc (np) .  By  definition  (3.21)  of  crunspc ,  there  exists  h  G  Name  — > 
IRProcess  such  that 


A  (V.r  G  Name  :  h(x )  G  np(x)(fs(x ))) 

A  (Vo  G  {1,2}  :  7ra (per 0 )  G  cruns(Xx :  Name.  {tTq, (a;) ) } ) )  - 


Let 


and  r[i 


per[?;]  =  (sfep(7Ti  o  tti  o  5tep(7Ti  o  tt2  o 

stepF(nf  ,fs)l(.LRun).  We  show  by  induction  that 


(Vi  G  N  :  (V.x  G  Name  :  pcr[i](.r)  G  [rfi]^)]^}^)), 


(3.62) 


where  pcr[i](x)  denotes  the  pointwise  application  of  pcr[i ]  to  x,  and 


Pv[i]  =  U  x&Name9(x)(pcr[i](x)), 


where  for  all  x,  g(x)  G  CHist  x  CHist  — >  interp(Var(x ))  is  a  witness  for  the  existential 
quantification  in  np(x)  CjQ,Jar('‘c')  nf(x )  when  the  universal  quantification  over  dom(p)  is 
instantiated  with  fs(x)  and  the  universal  quantification  over  p(fail)  is  instantiated  with 
h(x). 


Base  Case.  For  i  =  0,  the  claim  is  that  (V.x  G  Name  :  _L cRun(x)  G  ^Run(x)rn-T’l%  which 
follows  easily  from  the  definitions. 

Step  Case.  Using  the  induction  hypothesis  and  the  definition  (3.59)  of  C.joF^x\  then 
simplifying  using  the  definition  (2.6)  of  step ,  we  get 

(V.x  G  Name  :  pcr[i  +  l](x)  G  lnf(x)(fs(x))(r[i](x))]pI^s^]). 


(3.63) 
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Monotonicity  of  all  the  7Ti(7Ti(/i(a;)))  and  all  the  7Ti(7t2 (h(x)))  imply 

7Tl  (pcr[i])  <CRun  Kl(pcr[i  +  1])  A  7T2(pcr[?;])  <CRun  7T2(pcr[?;  +  1]). 

Monotonicity  of  all  the  g{x)  then  implies  pv[i]  <interp  Pv[i  +  !]•  So,  by  monotonicity  of 
Office  'n  P  (from  (3.58)),  (3.63)  still  holds  if  pv[i]  is  replaced  with  pv[i  +  1],  From  the 
resulting  equation  and  definition  (3.6)  of  stepF ,  we  get  (\/x  G  Name  :  pcr[i  +  l](a;)  G 
[ r[i  +  l](^)]ffisfy(5!+1^).  This  completes  the  proof  of  (3.62). 

Finally,  we  show  that  (3.62)  implies  pcr0  G  Jr] Fun .  Since  both  runs  in  pcr0  are  finite, 
there  exists  to  G  N  such  that  (Vi  >  to  :  pcr0  =  pcr[i\).  The  desired  result  is  obtained  by 
instantiating  the  universal  quantification  in  (3.62)  with  t  =  max(?'/p,  i0).  I 

3.5.4  Termination  of  Fixed-Point  Calculations 

The  development  here  is  closely  analogous  to  the  development  in  Section  2.4.2. 

Orderings.  The  orderings  are  defined  in  exactly  the  same  way  as  in  Chapter  2,  except  that 
they  are  built  on  an  ordering  set{Seq 2)  011  sets  of  pairs  of  sequences,  instead  of  an  ordering 
—Set(Seq)  011  sets  of  sequences.  Thus,  the  pre-order  on  posets  of  ms-atoms  is 

Si  —*pctSet( ipc)  S2  =  G  Interp(Var)  :  (3pt  G  Interp(lvar)  :  (3.64) 

ire  IPctJpv  J  IT  o  jPcNPv'StPl)  \\ 

PlIpOS’e^Lfr)  —Set(Seq2)  llD2  \  POSet{LFC)>  >  ■ 

where  ■< set(Seq 2)  is  defined  by 

S  —  Set(Seq2)  S'  =  (V X  G  Si  :  ( 3x '  G  S2  :  7Ti(.r)  <Seq  Ki(x')  A  7 T2(x)  <Seq  ^{x'))).  (3.65) 

The  definitions  of  <posZ(lfc)i  <PinmstFC^  <Outm*tFC ’  and  <RunFC  are  exactly  analogous  to 
(2.79),  (2.80),  (2.81),  and  (2.83),  respectively,  with  the  obvious  substitutions:  replace  L 
with  LFc ,  Hist  with  Histpc,  He.  The  definition  of  monotonicity  for  elements  of  IOFpc  is 
obtained  from  the  definition  (2.82)  of  monotonicity  for  IOF  by  quantifying  over  failures  and 
applying  the  above  substitutions;  thus  /  G  IOFpc  is  monotonic  with  respect  to  Ivar  C  Var 
and  pa  G  interp(Con)  iff 

(Y fail  G  dom(f)  :  (Vpc  G  Interp(Con,  pa)  :  (V/q  G  HistFC  :  (V/q  G  HistFC  : 
hi  <%Hi*FO  h2  =>•  f{faU)(h i)  <^ZtFC  f{fail)(h2))))). 
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Monotonicity  of  stepF.  As  in  Section  2.5.1,  monotonicity  of  the  “step”  function  follows 
from  monotonicity  of  the  input-output  functions.  This  is  expressed  by  the  following  theorem. 

Theorem  3.2.  For  all  nf  G  Name  —?  IOFfc  and  all  pa  G  interp(Con ),  if  for  all  x  €  Name , 
nf(x)  is  monotonic  with  respect  to  Var(x)  and  pa,  then  for  all  fs  G  FS(nf ),  stepF(nf,fs )  is 
monotonic  with  respect  to  <Runpc- 

The  proof  is  almost  identical  to  the  proof  of  Theorem  2.4. 


The  First  Step.  As  in  Section  2.5.2,  a  conjunct  must  be  added  to  the  definition  of  input- 
output  functions  to  ensure  that  FRun  <Runpc  step  F(nf ,  fs)( FRun).  By  analogy  with  (2.84), 
the  conjunct  that  must  be  added  to  definition  (3.59)  of  p  \HPqFfc  f  is 


(Vfail  G  dom(p )  :  (\/y  G  Name  : 

noInitialOut FC (p(fail),  y)  =>  f(fail)(±Hist)(y)  =  (0,0))), 

where 


(3.67) 


noInitialOut FC (p,  y)  =  (V((dp,  ir),  ( dp',  ir'))  G  p  :  1  cmst  G  ir  A  T CHlst  G  ir'  (3.68) 

=4>  dp(±cHist)(y)  =  £  a  dp'(±cmst)(y)  =  e). 

This  suffices  to  establish  the  following  theorem. 

Theorem  3.3.  For  all  np  G  Name  — >  Process Fc,  all  nf  G  Name  — >  IOFfc,  and  all  pa  G 
interp(Con),  if  np  nSySpc  ( nf,pa ),  then  for  all  fs  G  FS(nf ),  <^*B  sfepF(?i/,/s)(T7i,K„). 

The  proof  is  similar  to  the  proof  of  Theorem  2.5. 


Finite  Ascending  Chains.  The  comments  in  Section  2.5.3  apply  here  as  well.  An  ana¬ 
logue  of  FACn  for  the  perturbational  framework  is  obtained  by  adding  the  requirement  that 
A AVal  have  size  at  most  n. 


Chapter  4 


Two  Classic  Problems  in 
Fault-Tolerance 


We  have  presented  two  analysis  frameworks:  the  non-perturbational  framework  in  Section 
3.1,  and  the  perturbational  framework  in  Sections  3.4  and  3.5.  To  illustrate  the  use  of  these 
two  frameworks,  we  apply  each  to  one  classic  problem  in  fault-tolerance. 

The  non-perturbational  framework  is  applied  to  a  protocol  for  reliable  broadcast  that 
tolerates  patterns  of  crash  failures  that  don’t  partition  the  network  [HT94,  section  6].  This 
example  demonstrates  the  power  of  symbolic  multiplicities  for  efficient  analysis  of  systems 
subject  to  crash  failures.  Showing  that  the  protocol  satisfies  the  basic  requirements  of  va¬ 
lidity,  agreement  and  integrity  [HT94,  section  3]  is  straightforward,  because  these  properties 
depend  mainly  on  equalities  between  multiplicities.  Showing  that  the  protocol  also  provides 
FIFO  message  delivery  requires  analyzing  inequalities  between  multiplicities.  Invariants  are 
useful  for  this. 

Next,  the  perturbational  framework  is  applied  to  a  protocol  for  Byzantine  Agreement.  A 
seminal  paper  by  Lamport,  Shostak,  and  Pease  defines  the  problem  of  Byzantine  Agreement 
and  presents  two  solutions  [LSP82],  We  analyze  the  first  of  those,  namely,  the  Oral  Messages 
algorithm.1  We  use  the  perturbational  framework  for  this  problem,  because  the  correctness 
requirements  are  easily  expressed  in  terms  of  acceptable  changes. 

One  motivation  for  analyzing  a  Byzantine  Agreement  algorithm  that  has  already  been 
proved  correct  [LM94]  is  to  allow  that  algorithm  to  be  used  as  a  benchmark  for  comparison 
of  different  verification  methods.  Also,  analysis  of  a  Byzantine  Agreement  algorithm  could 

1Their  second  solution,  the  Signed  Messages  algorithm,  requires  digital  signatures  and  can  only  be  ana¬ 
lyzed  using  the  techniques  presented  in  Chapter  5. 
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provide  a  starting  point  for  analysis  of  more  complicated  systems,  such  as  digital  flight 
control  systems  [DBC91],  in  which  a  Byzantine  Agreement  algorithm  is  only  one  of  the 
fault-tolerance  mechanisms. 

4.1  Reliable  Broadcast 

Section  4.1.1  introduces  a  reliable  broadcast  protocol  and  its  specification;  both  are  adopted 
from  [HT94],  Section  4.1.2  discusses  modeling  of  crash  failures  for  this  protocol  and  gives  an 
analysis  of  relationships  between  multiplicities.  This  discussion  motivates  the  fault-tolerance 
requirement  in  Section  4.1.3  and  the  definitions  of  the  input-output  functions  in  Section  4.1.4. 

4.1.1  Reliable  Broadcast  Protocol 

Consider  a  system  with  clients  Ci,, . . ,  Cn  and  corresponding  servers  .S'| ,  S„ .  A  function 

nbrs  G  Name  — >  Set  (Name)  describes  the  connectivity  of  the  network.  We  assume  each 
client  can  communicate  directly  only  with  the  corresponding  server,  so  nbrs(Ci)  =  {S'?;}-  A 
server  can  communicate  directly  with  its  client  and  other  of  the  servers,  so  nbrs  (Si)  satisfies 
{Q}  C  nbrs(St)  C  {Q}  U  {ft, ,  Sn}  \  {S',}. 

Informally,  the  reliable  broadcast  protocol  in  [HT94,  section  6.3]  is  as  follows.  A  client 
Cj  initiates  a  broadcast  of  a  message  by  sending  the  message  to  its  server  St.  When  a  server 
receives  a  message,  it  checks  whether  it  has  received  that  message  before.  If  so,  it  ignores 
the  message;  if  not,  it  sends  the  message  to  all  of  its  neighbors.  When  a  client  receives  a 
message  from  its  server,  we  say  it  delivers  that  message. 

Following  [HT94,  section  3.1],  we  assume: 

Known-Sender:  Each  message  contains  the  name  of  its  sender. 

Uniqueness:  Clients  send  each  message  at  most  once.  This  is  easily  implemented 
by  including  a  unique  identifier,  such  as  a  sequence  number  or  timestamp,  in  each 
message. 

The  known-sender  assumption  is  captured  by  having  client  x  send  messages  with  abstract 
value  MF(x)  (mnemonic  for  “message  from  x ”).  The  meaning  of  MF(x)  depends  on  the 
message  format.  For  example,  if  messages  are  tuples  containing  the  sender’s  name,  a  sequence 
number  that  is  an  element  of  N,  and  data  of  type  D.  then 

wnx)hv.,=  u  <*.»,<*>•  (4-i) 

ieN,deD 
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There  are  two  approaches  to  modeling  the  second  assumption:  the  unique  identifier  can 
be  modeled  explicitly  or  implicitly.  As  an  example  of  explicit  modeling,  we  could  make  (say) 
sequence  numbers  explicit  by  using  abstract  values  of  the  form  MF'(x ,  i),  where 

\MF\x,i)\AVal  =  U  (x,i,d). 

deD 

Modeling  the  unique  identifier  implicitly  is  preferable  because  it  yields  a  more  general  model: 
it  abstracts  from  particular  schemes  for  generating  unique  identifiers.  The  uniqueness  as¬ 
sumption  is  expressed  directly  by  using  a  different  variable  to  represent  each  message  and 
asserting,  as  an  invariant,  that  the  values  of  these  variables  are  distinct.  For  x  G  Name , 
let  Varu{x )  C  Var(x)  be  the  variables  used  to  represent  messages  broadcast  by  x.  The 
invariant  is 

Irb(x )  =  {p  G  interp(Var(x ))  |  let  S  =  dom(p)  fl  VarM(x)  (4.2) 

in  (Vi>i  G  S  :  (Vv2  G  S\  {iq}  :  p(v i)  ^  p(^2))))}- 

For  illustration,  consider  the  system  with  n  =  3  and  with  each  server  having  the  other 
two  servers  as  neighbors.  Suppose  client  C\  broadcasts  a  single  message  X  :  MF(Ci),  and 
the  other  clients  broadcast  no  messages.  Assuming  the  clients  and  servers  run  the  protocol 
sketched  above,  the  run  in  Figure  4.1  represents  the  failure- free  behavior  of  this  system. 
Since  multiplicities  play  a  central  role  in  analysis  of  reliable  broadcast  ,  we  do  not  elide  any 
multiplicities  in  figures  in  Section  4.1. 

Specification.  The  defining  properties  of  reliable  broadcast  are  [HT94,  section  3]: 

Validity:  If  a  client  C,  broadcasts  a  message  m  and  corresponding  server  5)  is  non- 
faulty,  then  (7,;  eventually  delivers  in. 

Integrity:  For  each  message  in  G  \MF{x)\AVaV  each  client  having  a  non-faulty  server 
delivers  in  at  most  once  and  only  if  in  was  previously  broadcast  by  x. 

Agreement:  If  a  client  having  a  non-faulty  server  delivers  a  message  in.  then  all  clients 
of  non-faulty  servers  eventually  deliver  m. 

FIFO  reliable  broadcast  must  also  satisfy 

FIFO  Order:  If  a  client  broadcasts  a  message  in  before  it  broadcasts  a  message  in', 
then  no  client  of  a  non-faulty  server  delivers  in'  unless  it  has  previously  delivered  in. 
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Figure  4.1:  Failure-free  behavior  of  the  reliable  broadcast  protocol. 


These  properties  are  required  to  hold  in  failure  scenarios  in  which  servers  crash,  provided 
the  network  remains  connected.  The  network  is  connected  in  failure  scenario  fs  if  for  each 
pair  (x,  y )  of  servers  that  are  non-faulty  in  fs ,  there  is  a  sequence  a  G  Seq(Name)  that  starts 
with  x ,  ends  with  y.  and  satisfies 

(Vi  G  dom(a)  \  {0}  :  fs(a[i])  =  OK  A  a[i\  G  nbrs(a[i  —  1])).  (4.3) 

Section  4.1.3  formalizes  this  specification  in  our  framework. 

4.1.2  Relationships  Between  Multiplicities 

When  analyzing  systems  that  experience  crash  failures,  there  is  a  spectrum  of  alternatives  for 
the  set  Fail ,  corresponding  to  the  inclusion  of  different  amounts  of  timing  information.  At 
one  end  of  the  spectrum,  the  timing  of  the  crash  can  be  abstracted  completely:  crash  G  Fail 
indicates  that  the  component  crashes  at  an  unspecified  time  during  execution.  To  include 
timing  information,  a  family  Uj{  crashi }  of  crash  failures  may  be  used,  where  crash-i  denotes 
a  crash  that  occurs  at  (logical)  time  i.  For  example,  one  might  take  crashi  to  denote  a  crash 
that  occurs  after  the  component  sends  i  messages.  Or,  if  a  synchronous  system  is  being 
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modeled  using  a  distinguished  value —  called  “tick” — to  model  the  passage  of  time  [Bro90, 
BD92],  one  might  take  crash.;,  to  denote  a  crash  that  occurs  after  i  ticks. 

Abstracting  from  timing  of  failures  reduces  the  number  of  failure  scenarios  that  need  to  be 
analyzed  and  thereby  makes  the  analysis  more  efficient.  Thus,  if  the  resulting  approximations 
are  not  too  coarse  (and  if  the  system  really  is  fault-tolerant),  then  the  analysis  based  on 
Fail  =  { OK ,  crash}  more  efficiently  establishes  that  the  system  satisfies  its  fault-tolerance 
requirement.  The  analysis  of  reliable  broadcast  in  this  section  illustrates  the  importance  of 
tracking  relationships  between  multiplicities  in  order  to  avoid  false  negatives.  We  start  by 
sketching  an  analysis  that  tracks  equalities  between  multiplicities.  That  analysis  suffices  to 
show  that  the  protocol  provides  reliable  broadcast  but  does  not  show  that  it  provides  FIFO 
reliable  broadcast.  To  show  that  FIFO  delivery  is  provided,  the  analysis  must  also  track 
inequalities  between  multiplicities.  We  describe  two  ways  of  modifying  the  analysis  to  do  so. 

Tracking  Equalities  Between  Multiplicities 

Suppose  the  input-output  function  representing  a  server  propagates  only  abstract  multiplic¬ 
ities,  using  the  wildcard  for  all  symbolic  multiplicities.  The  effect  of  a  crash  is  expressed 
using  an  abstract  multiplicity  of  ?  instead  of  1  in  the  server’s  outputs,  to  reflect  the  possibil¬ 
ity  of  the  server  crashing  before  sending  the  message.  More  concretely,  consider  the  system 
with  n  =  3  described  above.  Consider  the  failure  scenario  fs1  in  which  only  Sj  is  faulty 
(i.e. ,  fs(Si)  =  crash).  Since  Si  might  crash  at  any  time,  all  messages  it  sends  have  abstract 
multiplicity  ?  instead  of  1.  The  inputs  to  the  non- faulty  servers  have  indefinite  multiplicities 
(i.e.,  multiplicities  not  satisfying  definite ,  defined  by  (2.57)),  so  the  outputs  of  those  servers 
also  have  indefinite  multiplicities.  Thus,  the  result  of  the  analysis  is  a  run  just  like  the  one 
in  Figure  4.1,  except  that  on  every  edge  except  (Ci,  Si),  the  multiplicity  in  the  ms-atom  is 
?  rather  than  1.  This  is  too  coarse  an  approximation  of  the  system’s  behavior,  because  this 
run  has  interpretations  in  which  C2  delivers  A"  and  C3  does  not  (and  vice  versa),  and  such 
concrete  runs  do  not  satisfy  Agreement. 

One  solution  is  for  the  input-output  function  representing  a  server  to  introduce  and 
propagate  symbolic  multiplicities:  if  C2  and  C3  receive  A"  with  the  same  symbolic  multiplicity, 
then  Agreement  is  ensured.  An  input-output  function  server  that  does  this  works  roughly 
as  follows  (see  Section  4.1.4  for  details).  Let  srcv  €  SVal  be  the  “symbolic  maximum”  of 
the  multiplicities  with  which  the  server  received  a  message  m;  for  example,  if  a  message  is 
received  with  multiplicity  X  :?  from  one  source  and  with  multiplicity  Y  :?  from  another,  then 
srcv  is  the  symbolic  value  max  (A'.  Y).  If  fail  =  OK,  server  outputs  message  rn  with  symbolic 
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multiplicity  srcv.  If  fail  =  crash ,  server  introduces  for  message  rn  a  variable  v  whose  value 
indicates  whether  that  server  crashed  before  outputting  that  message;  rn  is  output  with 
symbolic  multiplicity  min(e,  srcv),  since  a  server  outputs  a  message  if  it  receives  the  message 
and  does  not  crash  too  soon.  Input-output  function  server  uses  the  following  naming  scheme 
for  the  variables  denoted  above  by  v:  the  value  (zero  or  one)  of  c.i.x.y  G  Var(x)  indicates 
whether  server  x  crashes  before  it  can  relay  to  component  y  the  i’th  message  broadcast  by 
client  c.  We  use  0-based  indexing,  so  the  first  message  broadcast  by  a  client  corresponds  to 
i  =  0. 

We  illustrate  the  analysis  for  this  method  of  modeling  the  system  by  using  the  same 
system  and  same  failure  scenario  as  above.  Let  nfRB  be  the  mapping  from  Name  to  Process F 
for  this  system,  using  the  input-output  functions  in  Section  4.1.4.  Figure  4.2  shows  the  run 
stepF(nfRB,fsi)3(-\-Run),  corresponding  to  a  partial  execution  of  the  protocol;  in  other  words, 
the  fixed-point  has  not  yet  been  reached.  We  see  in  this  figure  that  .Si  has  received  message 
A"  with  multiplicity  1  and  sent  X  to  each  of  its  neighbors  y  with  symbolic  multiplicity 
min(Ci.0.Si.?/, max(l)).  Since  Ci.0.S\.y  is  zero  or  one,  this  symbolic  multiplicity  simplifies 
to  Ci.O.Si.y,  as  shown  in  the  figure.  Servers  S2  and  S3  have  received  messages  from  5i  and 
forwarded  them  to  their  neighbors.  In  the  next  step,  S2  and  S3  would  each  output  X  with 
the  maximum  of  the  two  symbolic  multiplicities  with  which  they  received  it.2  That  step 
yields  the  fixed-point,  which  is  shown  in  Figure  4.3.  Clients  C2  and  C3  of  the  non-faulty 
servers  deliver  X  with  the  same  symbolic  multiplicity,  so  Agreement  is  satisfied. 

Tracking  Inequalities  Between  Multiplicities 

The  analysis  just  described  suffices  to  show  that  the  protocol  sketched  in  Section  4.1.1 
provides  reliable  broadcast.3  However,  the  analysis  is  too  weak  to  show  that  the  protocol 
provides  FIFO  delivery,  even  though  it  does.  For  example,  suppose  Gj  sends  two  messages 
A^0  and  A', ,  in  that  order.  The  input-output  function  described  above  for  servers  handles 
each  broadcast  message  independently,  so  the  result  of  the  analysis  is  a  run  similar  to  the 
one  in  Figure  4.3,  except  that  on  each  edge,  the  singleton  poset  (£,  0)  is  replaced  with  the 
poset  ({G0,  Ci},  0) ,  where  ti  is  i  with  A"  replaced  with  A',  and  with  Ci.O.Si.y  replaced  with 

2Note  that  the  output  of  Si  does  not  change  as  a  result  of  receiving  the  messages  from  S 2  and  S3,  because 
min(C'i-0.S'i.?/,max(l,  C1.O.S1.S0,  C1.O.S1.S3))  simplifies  to  Ci.O.Si.y.  This  simplification  relies  on  the  fact 
that  all  of  the  variables  in  the  former  expression  represent  values  in  {0, 1}. 

3 Although  only  the  Agreement  requirement  was  discussed,  it  is  easy  to  see  that  Validity  and  Integrity 
also  follow  from  the  results  of  the  analysis. 
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Figure  4.2:  Initial  behavior  of  the  reliable  broadcast  protocol  when  <Si  crashes;  more  precisely, 
the  run  stepF(nfRB,fsi)3(-LRun).  In  the  figure,  x.y  abbreviates  Ci.O.x.y. 

C\.i.S\.y.  For  example,  the  input  to  C2  and  C3  (from  S-)  and  S3,  respectively)  is 

({A"0 :  MF ( Ci )max^ Cl •&,  Ci .o.&A):? ^  Xi :  MF  (Ci)ma,x<'Cl'1'Sl'S2'Cl'1'Sl'S3",:'!} ,  0).  (4.4) 

This  run  satisfies  validity,  integrity,  and  agreement,  but  it  represents  concrete  runs  that  do 
not  satisfy  FIFO  Order.  For  example,  for  an  interpretation  pso  such  that 

Pso(C'i-0-Si.S2)  =  0 

PsoiCt.O.S^Sz)  =  0 

pso  ( Pi  ■  1  •  <Sl  ■  <52 )  =  1 , 

clients  C2  and  C3  both  appear  to  deliver  A'|  but  not  A'0. 

The  imprecision  in  this  particular  analysis  stems  from  an  imprecision  in  modeling  crash 
failures.  No  constraints  are  given  between  the  values  of  variables  used  in  symbolic  multiplic¬ 
ities  of  different  messages,  so  the  output  ms-atoms  of  a  faulty  server  represent  executions  in 
which  that  server  fails  to  send  an  arbitrary  subset  of  its  original  outputs.  In  other  words,  the 
input-output  function  sketched  above  actually  represents  servers  subject  to  send-omission 
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Figure  4.3:  Behavior  of  the  reliable  broadcast  protocol  when  S]  crashes,  i.e.,  the  run 
runF(nfRB)(fs1).  In  the  figure,  x.y  abbreviates  Ci.O.x.y. 

failures;  recall  from  Section  2.2.3  that  send-omission  failures  cause  a  component  to  possibly 
omit  the  sending  of  each  message  normally  produced  [HT94,  Section  2.3].  Thus,  the  analysis 
sketched  above  shows  that  the  protocol  provides  reliable  broadcast  despite  send-omission 
failures.  (Crash  failures  can  be  regarded  as  a  special  case  of  send-omission  failures.) 

To  establish  that  the  protocol  provides  FIFO  delivery  even  in  the  event  of  crash  failures, 
the  analysis  must  reflect  the  prefix  property  of  crashes:  a  component  that  crashes  sends  only 
a  prefix  of  its  original  outputs  (in  other  words,  fails  to  send  an  arbitrary  suffix  of  its  original 
outputs).  This  implies  that  later  messages  are  sent  with  multiplicities  less  than  or  equal  to 
the  multiplicities  of  earlier  messages.  There  are  two  ways  of  expressing  these  inequalities: 
encode  them  using  combinations  of  max  and  min,  or  express  them  in  an  invariant.  We 
discuss  each  of  these  two  approaches  in  turn. 

Tracking  Inequalities  using  Max  and  Min.  This  approach  requires  changing  the  mean¬ 
ing  of  c.i.x.y  slightly,  so  that  the  value  of  min(U?<?:{c.j.:7:.y })  indicates  whether  server  x 
crashes  before  it  can  relay  to  component  y  the  7th  message  broadcast  by  client  c.  To  con- 


tinue  the  example  in  which  client  C\  broadcasts  X0  and  Xt .  the  analysis  based  on  this 
approach  yields  as  the  input  to  C2  and  C3  the  totally-ordered  poset  (written  as  a  sequence) 


((A"0 :  MF  ( Ci  )max( Cl  -0-51  -S2  ’ Cl  -0-51  i 

Yi  •  MF(Ci  )max(min(K-0.Si.Sh,<yL.i.Si..s^§ 


min 


(Cl.  0.51. 5s,  Cl.  l.Si.  S3)):? 


» 


(4.5) 


instead  of  yielding  (4.4).  The  more  precise  symbolic  multiplicities  in  the  input  of  S2  and 
63  have  allowed  more  precise  symbolic  multiplicities  in  their  outputs  and  a  more  precise 
ordering  on  the  poset.  Roughly,  these  strengthenings  are  justified  by  the  fact  that,  for  all 
interpretations  of  the  variables,  the  multiplicity  of  A'0  is  less  than  or  ('qua!  to  the  multiplic¬ 
ity  of  A' | :  this  fact  follows  easily  from  monotonicity  of  max  and  from  the  arithmetic  fact 
min(?'o,R)  <  ?'0.  We  omit  details  of  this  approach,  since  the  approach  based  on  invariants  is 
more  elegant  and  efficient. 


Tracking  Inequalities  using  Invariants.  This  approach  retains  the  original  meaning  of 
c.i.x.y  and  simply  asserts  that  for  i  <  j,  p(c.j.x.y)  <  p(c.i.x.y).  This  prohibits  interpretations 
like  pso.  Thus,  we  strengthen  invariant  Irb{x)  from  (4.2)  with  the  conjunct 

(Vc  G  Client  :  (Vi,  j  G  N  :  ( My  G  nbrs(x)  :  (4.6) 

{c.i.x.y,  c.j.x.y}  C  dom(p)  A  i  <  j  A  p(c.i.x.y )  G  {0, 1} 

A  p(c.j.x.y)  <  p(c.i.x.y)))), 

where  the  set  of  clients  is  Client  =  {Ci, . . . ,  Cn}.  To  continue  again  the  example  in  which 
client  C\  broadcasts  X0  and  Ad,  this  analysis  yields  the  run  in  Figure  4.4.  The  inputs 
to  C)  and  C3  are  the  same  as  in  (4.4),  except  that  the  poset  is  totally-ordered  and  the 
interpretations  of  the  variables  are  restricted  by  the  invariant  (4.6).  Details  of  this  approach 
appear  in  Section  4.1.4. 

4.1.3  Fault-Tolerance  Requirement 

The  fault-tolerance  requirement  for  FIFO  reliable  broadcast  is 

b(fs)(r)  =  (the  network  is  connected  in  fs)  =>  60(/s,r), 

where  b0 (fs,  r )  is  the  conjunction  of  the  following  five  predicates,  which  together  express  the 
assumptions  (namely,  Known-Sender  and  Uniqueness)  and  requirements  described  in  Sec¬ 
tion  4.1.1.  Predicates  formalizing  the  assumptions  are  included  here  because  the  predicates 
formalizing  the  requirements  depend  on  the  assumptions;  specifically,  if  the  assumptions  did 
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Figure  4.4:  Behavior  of  the  reliable  broadcast  protocol  when  S\  crashes.  In  the  figure, 
totally-ordered  posets  are  written  as  sequences,  the  abstract  value  MF(C\)  is  elided,  and 
£.  —  xi  ■  MF(Ci)m^Cl-LSl-S2'Cld-Sl-S3^ 

not  hold,  the  other  predicates  would  not  have  the  intended  meaning.  For  each  predicate, 
we  indicate  in  parentheses  the  conditions  from  Section  4.1.1  to  which  that  predicate  roughly 
corresponds. 

1.  {Known- Sender)  For  each  client  C) .  every  output  ms-atom  of  C)  contains  a  value  of 
the  form  {v:MF(Cj)}  for  some  v  G  \'ai' \:(  ()). 

2.  ( Uniqueness )  For  each  client  C*  and  each  v  €  VarM(Cj ),  v  appears  in  at  most  one 
output  ms-atom  of  C, .  and  that  ms-atom  has  multiplicity  { s :  1 }  for  some  s  G  SVal. 

3.  ( Integrity )  Every  input  ms-atom  of  every  client  contains  a  value  that  occurs  in  some 
client’s  output. 

4.  ( Agreement ,  Validity )  For  each  client  Q  and  each  v  G  1  'ar  y  (('-,}  that  appears  in  Cd  s 
output: 
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(a)  li  fs (Si)  =  OK ,  then  for  each  client  Cj  such  that  fs(Sj)  =  OK ,  Cj  delivers  v 
exactly  once,  i.e. ,  Cj' s  input  contains  exactly  one  ms-atom  containing  v,  and  that 
ms-atom  is  of  the  form  p:  MF(Ci)s:1  for  some  s  E  SVal. 

(b)  If  fs(Si)  =  crash ,  then  there  exists  s  E  SVaR  such  that  for  each  client  Cj  such 
that  fs(Sj)  =  OK,  Cj' s  input  contains  exactly  one  ms-atom  containing  v,  and 
that  ms-atom  is  of  the  form  v :  MF(Ci)s" . 

5.  ( FIFO  Order)  For  each  client  C) ,  each  (f0,^i)  €  7r2 ) ( C'* ) ) ,  and  each  client  Cj 
such  that  fs(Sj)  =  OK,  Cj  does  not  deliver  the  message  represented  by  t\  unless  it 
has  previously  delivered  the  message  represented  by  l§.  More  precisely,  let  f0  be  the 
unique  ms-atom  in  r(Cj)(Sj)  containing  the  same  variable  as  4,  and  similarly  for  (i\  .4 
We  require  that  (t"0,  (i\ )  E  7t2 (r(Cj)(Sj))  and  that  the  symbolic  multiplicities  .sq  and  s\ 
in  C0  and  £[,  respectively,  satisfy  s\  <sv  sq,  where  the  relation  <sv  on  symbolic  values 
captures  the  inequalities  implied  by  the  invariant  (4.6)  and  by  the  meaning  of  min  and 
max: 


c1  =  c  A  i'  <  i  A  x1  =  x  A  y’  =  y 
(Vs  ES:  (3s'  G  5'  :  s  <6V  «')) 

(Vs1  E  S'  :  (3s  ES:s  <sv  s')) 
false 

With  the  invariant  defined  by  (4.2)  and  (4.6),  server  S,  is  represented  by  the  input-output 
function  server  (5),  nbrs(Si)),  where  for  me  E  Name  and  nbrs  C  Name,  server  (me,  nbrs), 
defined  in  Figure  4.5,  works  as  follows.  If  the  destination  x  is  not  a  neighbor,  then  the  server 
sends  no  messages  to  x.  Predicate  noRepeats  (defined  below)  checks,  roughly,  whether  the 
input  satisfies  the  known-sender  and  uniqueness  assumptions.  If  not,  server  just  “gives 
up”  and  returns  a  poset  representing  arbitrary  outputs.  Otherwise,  the  server’s  outputs  are 
computed  as  follows.  For  n  E  nbrs,  mf(n)  is  just  the  set  of  “messages”  (more  precisely, 
values  representing  messages)  received  from  n.  Set  msgs  is  the  set  of  all  messages  received 
by  this  server.  Since  the  server  relays  every  message,  msgs  is  also  the  set  of  messages  that 
will  appear  in  the  server’s  outputs.  Thus,  the  set  0  of  output  ms-atoms  is  given  by  a  union 


(4.7) 


s  <sv  s'  =  match  ( s ,  s')  with 
|  ( c.i.x.y ,  c' A'.x'.y') 

|  (max(S'),  max(S")) 

|  (min(5),min(S")) 

4.1.4  Input-Output  Functions 


4Existence  of  £'0  and  £[  is  guaranteed  by  the  previous  requirement. 
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over  msgs.  In  this  union,  function  rnuljiB  (defined  below)  is  used  to  compute  the  multiplicity 
associated  with  each  output  message;  this  is  done  in  the  manner  sketched  in  Section  4.1.2. 
Next,  the  order  ~<  in  which  the  server  outputs  the  messages  is  computed  using  the  following- 
rule:  if  every  neighbor  that  sent  v2  sent  V\  first  (in  other  words,  if  V\  precedes  v2  in  the  poset 
representing  the  input  from  that  neighbor),  then  this  server  definitely  receives  V\  before  v2, 
so  V\  -<  v2.  Predicate  precede (S,  rq,  v2)  (defined  below)  checks  whether  value  iq  precedes 
value  v2  in  poset  S.  Finally,  the  output  poset  is  computed  from  O  and  -<.  The  following 
paragraphs  give  details  of  the  auxiliary  functions  used  in  the  definition  of  server. 


server(me ,  nbrs)  = 

(A  f ail :  {OK ,  crash} .  ( Xh:HistFc ■  (Xx:  Name. 
if  x  ^  nbrs  then  (0,  0) 

else  if  -i (Vra  G  nbrs  :  noRepeats(iri(h(n ))))  then  Ty,  0)},  0) 

else  let  mf  =  ( Xn:nbrs .  T^i^i  (h(n)))) 
in  let  msgs  =  Unenbrsmf(n) 
in  let  -<=  {(iq,D2)  G  msgs  x  msgs  \ 

(Vn  G  nbrs  :  v2  G  mf(n )  =>  A  V\  G  mf(n ) 

A  precede(h(n),  V\,  ^2))} 

in  let  O  =  U vemsgs  {(mul RB(nbrs ,  h.fail ,  v,  me,  x,  msgs ,  -<),  v,  0)} 
in  (O,  {£1,  £2  G  O  x  O  |  7t2(£i)  -<  7r2(£2)})))) 

Figure  4.5:  Definition  of  server. 

Predicate  noRepeats(S)  checks  that  each  message  in  S'  G  POSet(L)  has  the  required 
format  and  is  sent  at  most  once: 

noRepeats(S)  =  (VI  G  S  :  (3c  G  Name  :  (3d  G  Varu^c)  :  (4.8) 

A7T2(£)  =  {v:MF(c)}  A7f2(:7ri(£))  C  {1,?} 

A  (V£7  G  S  \  {£}  :  v  7fT(7r2(f )))))). 

Predicate  precede(S,  iq,  v2)  checks  whether  v\  G  Val  is  definitely  sent  (and  hence  received) 
before  v2  G  Val  in  the  sequences  of  messages  represented  by  S  G  POSet(L): 

precede(S,v \,v2)  = 

(Wi,  i2  G  7Ti(S)  :  (7T2(£i)  =  V\  A  7 r2(£2)  =  v2)  =>  A  (£1,  £2)  G  7T2(S) 

A7fT(vri(£2))  <set(sv)  7fT(vri(£i))), 


(4.9) 
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where  the  extension  of  <sv  to  sets  of  symbolic  values  is 

S  —Set(sv)  S'  =  (Vs  G  S  :  (Vs7  G  S'  :  s  <Sv  s7)).  (4.10) 

Multiplicity  mul  RB{nbrs ,  h,fail ,  u,  me,  x,  msgs ,  -<),  defined  in  Figure  4.6,  is  the  multiplic¬ 
ity  with  which  server  me  G  Name  with  neighbors  nbrs  C  Name  sends  message  ?;  G  Fa/ 
to  neighbor  x  G  Name ,  given  inputs  /?,  G  Hist  and  failure  fail  G  {(TA,  eras/i},  and  with 
msgs  and  -<  as  in  the  definition  of  server.  It  is  computed  as  follows.  First,  the  multiplicity 
with  which  the  server  received  value  v  is  computed;  the  abstract  and  symbolic  parts  of  this 
multiplicity  are  arcv  and  srcv,  respectively.  If  fail  =  OK,  then  the  multiplicity  with  which  v 
is  received  is  also  the  multiplicity  with  which  v  is  relayed  by  the  server.  Otherwise  (i.e.,  if 
the  server  crashes),  the  symbolic  multiplicity  with  which  v  is  relayed  is  the  minimum  of  srcv 
and  a  variable  that  indicates  whether  the  server  crashed  before  sending  v .  If  the  inputs  from 
c  are  totally  ordered,  then  a  variable  of  the  form  c.i.me.x  is  used,  as  described  in  Section 
4.1.2;  otherwise,  there  is  no  easy  way  to  associate  an  index  i  with  the  message,5  a  “fresh” 
variable — in  particular,  a  variable  not  of  that  form,  whose  value  is  therefore  not  constrained 
by  (4.6) — is  used  instead.  Function  freshvar(c,v,  me,  x)  G  Var(me)  is  assumed  to  return 
such  a  variable. 

Other  functions  used  in  the  definition  of  mulRB  are  as  follows.  Recall  that  apply  was 
defined  following  (2.53).  For  a  poset  p,  linearizefp)  returns  a  sequence  that  is  some  lin¬ 
earization  of  p;  in  the  definition  of  mul RB,  it  doesn’t  matter  which  linearization  is  returned, 
because  max  is  commutative.  For  a  value  v,  getClnt(v )  is  c  if  ^(v)  =  {MF(c)}  and  is  unde¬ 
fined  otherwise.  Note  that  the  check  of  noRepeats  in  server  ensures  that  the  application  of 
getClnt  in  mul RB  will  be  defined.  Similarly,  getSymfv )  is  s  if  tFT(c)  =  { s }  and  is  undefined 
otherwise.  For  a  totally-ordered  poset  p,  getlndex{x,p)  returns  the  least  i  G  N  such  that 
p[i]  =  x  (where  p  is  regarded  as  a  sequence),  if  it  exists,  and  is  undefined  otherwise. 

A  simplification  routine  simplify  G  SVal  — >  SVal  is  used  to  simplify  expressions  involving- 
max  and  min.  The  invariant  (4.6)  justifies  assuming  during  the  simplification  that  each 
variable  of  the  form  c.i.x.y  represents  a  value  in  {0, 1}.  Thus,  powerful  Boolean  simplification 
procedures  can  be  used.  In  particular,  putting  the  expressions  in  some  canonical  form  (e.g., 
disjunctive  normal  form,  or  ordered  binary  decision  diagrams)  helps  ensure  termination  of 
the  fixed-point  iteration.  For  examples  with  single  failures,  the  analysis  terminates  even  if 

sThe  index  set  could  be  generalized  to  be  some  partial  order  (S,  -<s),  instead  of  the  natural  numbers.  The 
invariant  could  be  extended  to  require  that  variables  of  the  form  c.s.x.y,  where  s  G  S,  satisfy  inequalities 
corresponding  to  the  partial  ordering  -<$■  This  would  give  a  more  precise  analysis  in  cases  where  the  output 
posets  of  clients  are  not  totally-ordered. 
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mul Rs(nbrs ,  h.fail ,  it,  me ,  re,  msgs,  -<)  = 

let  arc„  =  if  (3 n  G  nbrs  :  (3f  G  7Ti (h(n))  :  ^(f)  =  '«  A  1  G  7T2 (7Ti (^) ) ) )  then  1 

else  ? 

in  let  s>CT  =  let  S  =  U nenbrs  let  Si  =  {f  G  Ti\(h(n))  |  7r2(T)  =  d} 

in  7Ti(7ri(A)) 

in  apply ( max,  linearize((S ,  0))) 
in  if  fail  =  OK  then  {(simplify (srcv) ,  arcv)} 
else  let  c  =  getClntfv ) 

in  let  p  =  {{V  G  msgs  |  =  {MF(c)}},  -<) 

in  let  =  if  totalOrd(p)  then 

let  i  =  getlndex(v,p ) 
in  cA.me.x 

else  freshvar(c ,  mc.x) 

in  {(s*mp/*/y(aj3p/y(min,  ((sF,  src,u)})),  ?}} 

Figure  4.6:  Definition  of  mulRB- 


simplify  does  only  trivial  simplifications  (e.g.,  simplify(max(max(rm,  n)pni))  =  max(m,n)). 
Additional  simplifications  are  needed  to  analyze  failure  scenarios  involving  multiple  crashes, 
because  the  symbolic  multiplicities  have  a  more  complicated  structure.  In  particular,  for 
single  crashes,  they  have  the  form  max(-  •  •),  as  in  Figure  4.3;  for  double  crashes,  they  have 
the  form  min(-  ••, max(-  ■•),••  •),  as  in  Figure  4.7;  for  triple  crashes,  they  have  the  form 
min(-  •  • ,  max(-  •  • ,  min(-  -  and  so  on.  Thus,  simplifications  involving  combinations 

of  min  and  max  are  needed. 

In  principle,  this  input-output  function  server  representing  a  server  can  be  used  with  any 
input-output  function  representing  a  client,  though  for  the  analysis  to  be  useful,  the  latter 
should  produce  outputs  satisfying  the  Known-Sender  and  Uniqueness  conditions. 

4.1.5  Examples 

An  example  of  the  analysis  for  a  failure  scenario  involving  a  single  crash  appears  in  Figure 

4.4. 

For  an  example  involving  two  crashes,  consider  a  system  with  n  =  4,  and  consider  the 
failure  scenario  in  which  Si  and  S2  crash.  The  result  of  the  analysis  is  shown  in  Figure  4.7. 
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Note  that  this  run  satisfies  the  fault-tolerance  requirement. 


Figure  4.7:  Behavior  of  the  reliable  broadcast  protocol  when  Si  and  S2  crash.  In  the  figure, 
C  =  X  :  MF((\  )"lax(‘S’ 


4.2  Byzantine  Agreement 

Now  consider  a  system  comprising  a  commander  C.  lieutenants  Li,...,Lra,  and  armies 
Ai, . . . ,  An.  The  goal  of  a  Byzantine  Agreement  protocol  is  for  the  commander  to  dissemi¬ 
nate  a  command  to  the  lieutenants,  so  they  can  then  act  on  this  command.  A  lieutenant  L, 
acts  on  (or  “decides  on” ,  in  the  terminology  of  [LSP82] )  a  command  by  sending  a  message 
containing  that  command  to  its  army  Ap  Thus,  the  set  of  components  in  the  system  is 

Name  =  {C}  U  Ltnts  U  (  U  {v}) 

i£[l..n] 


(4.11) 
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where 

Ltnts  =  (J  {Li} 

ie[l..n] 

[ i..j ]  =  {k  G  N  |  i  <  k  <  j}. 

The  Oral  Messages  algorithm  of  [LSP82]  to  solve  this  problem  works  under  the  following 
assumptions  about  the  underlying  communication  mechanism: 

Integrity:  Every  message  that  is  sent  is  delivered  correctly. 

Known-Sender:  The  receiver  of  a  message  knows  who  sent  it. 

Missing-Message-Detection:  The  absence  of  a  message  can  be  detected. 

The  first  two  assumptions  are  built  into  our  framework:  Integrity,  because  our  definition 
of  step  never  removes  messages  from  a  run;  Known-Sender,  because  histories  classify  mes¬ 
sages  by  sender,  and  input-output  functions  take  histories  as  arguments.  Missing-Message- 
Detection  is  not  satisfied  by  our  framework  but  can  be  encoded  using  standard  techniques 
[Bro90,BD92],  In  particular,  we  adopt  the  following  convention:  when  a  component  in  re¬ 
ality  omits  to  send  a  message,  this  omission  is  modeled  by  sending  a  distinguished  value 
tmout  (“timeout”).  By  this  convention,  receiving  tmout  corresponds  to  detecting  absence 
of  a  message.  This  convention  is  used  at  both  the  concrete  and  abstract  levels,  so  we  treat 
timeout  as  both  a  concrete  value  and  an  abstract  value,  with  \tmout\AVal  =  {tmout}. 

The  basic  Oral  Messages  algorithm  of  [LSP82]  further  assumes  that  the  commander  and 
lieutenants  can  communicate  with  each  other  directly.  We  associate  with  each  component 
x  G  Name  the  set  nbrs(x)  C  Name  of  its  neighbors,  i.e.,  the  set  of  components  with  which 
it  can  communicate  directly.  We  take: 

nbrs(C )  =  Ltnts 

nbrs(Li)  =  {C,  A*}  U  Ltnts  \  {L*} 

nbrs(Ai)  =  {Lt}. 

Fault- Tolerance  Requirement.  We  consider  Byzantine  failures  of  the  commander  and 

lieutenants.  A  Byzantine-faulty  component  may  send  an  arbitrary  number  of  arbitrary  values 
to  each  of  its  neighbors.  A  Byzantine  failure  is  represented  by  failure  ByzFail  G  Fail.  The 
fault-tolerance  requirement  is 

b(fs)(r)  =  | {a;  G  Name  \  fs(x)  ^  OK}  \  <  \_n/ 3J  b0(fs,r), 


96 


where  bo(fs,r)  is  the  conjunction  of  the  following  conditions: 

1.  If  the  commander  is  non-faulty  (i.e.,  fs(C )  =  OK ),  then  the  inputs  of  the  armies  asso¬ 
ciated  with  non-faulty  lieutenants  are  unchanged  compared  to  the  failure-free  behavior, 
i.e.,  for  each  such  army  A,  unchanged(r(A)). 

2.  If  the  commander  is  faulty  (i.e.,  fs(C')  =  ByzFail ),  then  all  armies  associated  with  non- 
faulty  lieutenants  receive  the  same  value,  i.e.,  for  some  s  G  SVclIq ,  for  each  such  army 
A.  r(A )  is  a  singleton  poset  {£},  and  l  has  original  multiplicity  “one”,  unperturbed 
multiplicity,  and  perturbed  value  s: 

A  =  {1}  A  unchanged Val(TTi(i),TTs(i))  ATifa^i))  =  {s-}.  (4.12) 

This  specification  has  a  pleasant  locality  property:  it  refers  only  to  the  inputs  of  the  armies. 
In  a  framework  without  explicit  perturbations,  the  specification  would  have  to  involve  the 
commander’s  inputs  and  outputs  as  well. 

The  Oral  Messages  algorithm  of  [LSP82]  is  essentially  a  recursive  application  of  majority 
voting.  The  interesting  aspects  of  its  behavior  can  already  be  seen  in  the  case  n  =  3,  which 
exhibits  a  single  level  of  recursion  plus  the  base  case.  So,  we  take  n  =  3  in  the  detailed  part 
of  the  exposition  and  then  sketch  the  extension  to  arbitrary  n.  The  algorithm  is  described 
in  Section  4.2.1.  Section  4.2.2  shows  the  results  of  the  analysis. 

4.2.1  Oral  Messages  Algorithm 

We  describe  the  algorithm  informally  before  formalizing  it  as  input-output  functions.  The 
commander  sends  a  command,  represented  by  variable  X  G  Var{C),  to  each  lieutenant. 
Each  lieutenant  forwards  the  value  it  receives  from  the  commander  to  the  other  lieutenants. 
When  a  lieutenant  has  received  values  from  the  commander  and  all  of  the  other  lieutenants, 
it  takes  the  majority  of  those  values  using  a  majority  function  (pmaj  ( cf .  Section  2.1.3)  and 
sends  the  result  to  its  army.  More  precisely,  L,  computes  cv 2,  cv 3),  where  cvj  is 

the  value  that  L,  received  from  the  commander  and  for  j  ±  t.  cvj  is  the  value  received  from 
Lj.  If  any  of  the  received  values  is  tmout ,  then  some  default  value  is  used  in  its  place.  The 
run  in  Figure  4.8  represents  the  failure-free  behavior  of  this  algorithm. 
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Figure  4.8:  Failure-free  behavior  of  the  Oral  Messages  algorithm. 

Definition  of  Cmdr.  The  input-output  function  representing  the  commander  is  Crude  (  Lints  ). 
where  for  dests  G  Set  (Name ), 

Cmdr(dests)  =  (Afail:  {OK ,  ByzFail}.  (Ah:  Hist.  (Ax:  Name.  (4.13) 

if  x  fL  dests  then  (0,  0) 
else  if  fail  =  OK  then  (1.  A' :  N,  id ,  id,  0) 
else  (1,  X :  N,  Tav,  +a,  0)))), 

The  use  of  +a  rather  than  *a  is  justified  by  the  convention  for  modeling  Missing-Message- 
Detection:  even  a  faulty  process  must  send  at  least  a  timeout. 

Definition  of  Ltnt.  The  input-output  function  representing  a  lieutenant  is  composed  of 
two  main  pieces:  one  that  relays  the  value  received  from  the  commander,  and  one  that 
handles  voting. 

Relaying  is  captured  by  the  function  relay  defined  in  Figure  4.9,  where 


arbnew  =  ({Ty*},0) 


(4.14) 
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For  fail  G  {OK ,  ByzFail},  S  G  POSet(LFc ),  var  G  Var,  and  aval  G  AVal,  the  original  part 
of  the  output  of  relay  (fail,  S,  var ,  aval )  is  determined  as  follows.  If  there  are  no  inputs,  then 
there  are  no  outputs.  If  S'  is  a  singleton  containing  a  value  of  the  form  {s:  aval},  then  that 
value  is  relayed  (i.e. ,  included  in  the  output).  If  neither  of  those  cases  applies,  then  the 
output  is  {var :  aval}.  At  the  concrete  level,  this  third  case  may  correspond  to  receiving  a 
timeout  (or  other  value  not  in  aval )  and  relaying  some  default  value,  or  to  receiving  multiple 
values  in  aval  from  the  (faulty)  commander  and  relaying  the  first  of  them.  We  don’t  need 
to  distinguish  these  two  possibilities  at  the  abstract  level,  because  they  both  arise  only  if 
the  commander  is  faulty,  in  which  case  it  doesn’t  matter  what  value  is  relayed,  provided  the 
same  value  is  relayed  to  all  of  the  other  lieutenants.  The  equality  of  the  values  relayed  to 
different  lieutenants  is  reflected  by  using  a  local  variable  var  (instead  of  just  a  wildcard)  to 
represent  the  relayed  value. 

Perturbations  to  the  relayed  output  are  determined  by  the  parameter  fail  and  by  per¬ 
turbations  and  new  ms-atoms  in  the  input  S.  For  fail  =  OK,  then  the  perturbation  to  the 
output  is  determined  roughly  as  follows:  if  the  inputs  are  unchanged  (i.e.,  the  perturbation 
is  id),  then  so  are  the  outputs;  if  the  inputs  do  change,  then  we  conservatively  assume  the 
relayed  value  may  change  to  any  other  value  in  aval,  so  we  take  the  perturbation  to  the 
output  to  be  aval<\,  whose  meaning  is  given  by  (3.34). 

For  fail  =  ByzFail,  perturbations  to  the  output  and  new  outputs  are  determined  as  fol¬ 
lows.  Arbitrary  new  outputs  may  be  sent  before  the  original  output;  these  are  represented 
by  arbnew.  The  original  output  value  may  change  arbitrarily,  so  we  take  the  perturbation 
to  be  Tav-  Perturbations  to  the  multiplicities  are  determined  by  rlyAMul(fail,  S),  whose 
definition  reflects  the  convention  for  modeling  Missing-Message-Detection:  if  the  lieutenant 
definitely  receives  a  message  from  the  commander,  then  it  definitely  relays  some  value  (pos¬ 
sibly  just  tmout).  This  is  a  special  case  of  Brov  and  Dendorfer’s  time  progress  property 
[BD92], 

For  cmdr  G  Name,  Itnts  G  Seq(Name),  me  G  Name ,  var  G  Var,  and  army  G  Name, 
input-output  function  Ltnt(cmdr,  Itnts,  me,  var,  army)  represents  a  lieutenant  named  me 
with  army  A  in  a  system  in  which  the  commander  is  cmdr  and  the  sequence  of  lieutenants 
is  Itnts;  also,  when  this  lieutenant’s  input  from  the  commander  contains  multiple  symbolic 
values,  this  lieutenant  uses  var  to  represent  the  value  received  from  the  commander  and 
relayed  to  the  other  lieutenants.  This  input-output  function  works  as  follows.  Function 
relay  is  used  to  determine  the  value  received  from  the  commander  and  relayed  to  the  other 
lieutenants;  this  value  is  contained  in  the  ms-atom  in  the  singleton  poset  returned  by  relay. 
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relay  (fail,  S,  var ,  aval)  = 

match  7Ti  (S)  fl  Lper  with 

|  0  — >  if  fail  =  OK  A  (tti (S)  fl  Lnew)  =  0  then  (0,  0) 
else  arbnew 

|  {{mtii,  val ,  8mul ,  <5wa/,  tag)}  — > 
let  wa/i  =  match  val  with 

|  {s :  a}  — >  if  a  =  aval  As/,  then  s :  a  else  war :  aval 
|  _  — >  war :  aval 

in  let  6  val  i  =  if  fail  =  OK  then 

if  (tti(S')  fl  Lnew)  =  0  A  TT2(8val)  =  {id}  then 

{7Ti(wa/i) :  id} 

else  {_:awa/A} 
else  {_:TAy} 

in  ({( rlyMul(S ),  {wa/i},  rlyAMul(fail ,  5),  Svali,  tag)},  0) 

|  _  — >  let  8 val  =  if  fail  =  OK  then  awa/A  else  TAy 

in  ({(rlyMul(S),  var :  aval,  rlyAMul(fail,  S),  8 val,  0)},  0) 


rlyMul(S)  =  if  (3i  G  tti(S)  :  definite^ \(t)))  then  {_:!}  else  {_:?} 


rlyAMul(fail,  S)  =  if  (At  G  tti(S)  :  definite^ \(t))  A7r2(7T3(f))  C  {id,+A})  then 

if  fail  =  OK  then  {_:  id}  else  {_:+A} 
else  if  fail  =  OK  then  {_:?A}  else  {_:*A} 


Figure  4.9:  Definition  of  relay,  with  two  auxiliary  functions. 
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Input-output  function  Voter fc,  defined  by  (3.53),  is  used  to  handle  voting,  with  the  result 
of  relay  (representing  the  value  received  from  the  commander)  being  used  as  the  vote  of  this 
lieutenant.  The  output  is  then  determined  by  considering  the  destination.  If  the  lieutenant 
is  non-faulty,  it  sends  its  decision — that  is,  the  outcome  of  the  vote — to  its  army.  To  other 
lieutenants,  the  result  of  relay  is  always  sent.  A  non-faulty  lieutenant  sends  nothing  to  the 
commander,  but  a  faulty  lieutenant  sends  arbitrary  messages  to  the  commander. 

The  input-output  function  for  lieutenant  Lj  is  Ltnt(C ,  ((Lj))ie[i..n],  L*,  A*),  where  Xi  £ 

Var(Lj)  and 


Ltnt(cmdr ,  Itnts ,  me,  var ,  army)  = 

(A fail :  {  OK ,  ByzFail}.  (A h :  Hist.  ( Xx :  Name. 
let  relay0  =  relay(fail ,  h(cmdr ),  var ,  N) 

in  let  decis  =  Voter Fc (Itnts,  army,  'N)(OK)(h  ©  (Arc:  {me},  relay 0)) (army) 
in  if  rc  =  army  then 

if  fail  =  OK  then  decis 
else  if  7Ti( decis)  =  0  then  arbneiv 
else  arbchng (decis) 
else  if  x  £  Itnts  \  {me}  then  relay 0 
else  if  x  =  cmdr  A  fail  ^  OK  then  arbnew 
else  (0,0)))), 

(4,15) 


where  for  i  £  LFC , 


arbchng(i)  =  match  £  with  (4.16) 

|  (m«I,  val ,  tay)  — *■  (*,  Ty,  toy) 

|  (mul,  val ,  6mul ,  dra/,  toy)  — *■  (m«l,  ra/,  Taw,  +a,  toy) 

and  arbchng  is  the  pointwise  extension  of  arbchng  from  ms-atoms  to  POSet(Lpc)  (as  usual, 
retagging  may  be  needed  in  the  extension).  Recall  that  ©  is  defined  following  (2.78). 


Definition  of  Army.  The  input-output  function  Army  for  an  army  is  equal  to  Actpc , 
defined  in  (3.48). 


Extending  the  Definitions  to  Arbitrary  n.  For  n  >  3,  the  Oral  Messages  algorithm 
proceeds  by  recursion.  Byzantine  Agreements  are  performed  among  smaller  and  smaller  sets 
of  lieutenants,  until  a  base  case  (a  singleton  set)  is  reached.  The  results  of  the  recursive 
invocations  of  Byzantine  Agreement  are  repeatedly  combined  using  majority  functions  (of 
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appropriate  arity)  to  determine  the  result  of  the  top-level  Byzantine  Agreement.  To  prevent 
confusion  between  messages  generated  by  different  recursive  calls  to  the  Byzantine  Agreement 
protocol,  each  message  is  tagged  with  an  identifier  indicating  which  invocation  of  the  protocol 
it  belongs  to.  A  simple  scheme  for  choosing  these  identifiers  is  described  in  [LSP82], 

Extending  the  definitions  in  this  section  to  arbitrary  n  involves  little  more  than  the 
additional  bookkeeping  needed  to  keep  track  of  these  identifiers.  There  is  one  caveat  if  a 
message  with  uncertain  identifier  is  received  from  a  lieutenant  (e.g.,  if  the  value  is  Ty),  then 
as  a  conservative  approximation,  we  assume  this  might  confuse  determination  of  the  values 
received  from  that  lieutenant  in  all  invocations  of  the  protocol,  so  we  use  Ty  for  all  those 
values. 

4.2.2  Analysis  of  Perturbed  Behavior 

For  n  =  3,  the  algorithm  is  required  to  tolerate  a  Byzantine  failure  of  any  one  component.  We 
consider  first  the  effects  of  a  lieutenant  failure  and  then  turn  to  the  effects  of  a  commander 
failure. 

Lieutenant  Failure.  Let  hJba  be  the  obvious  mapping  from  Name  to  Process fc  for  this 
example.  Let  fsL  be  the  failure  scenario  in  which  only  L2  is  faulty.  Figure  4.10  shows  the 
run  runpc(nfBA)(fsL )•  To  reduce  clutter,  where  the  posets  on  the  edges  in  both  directions 
between  a  pair  of  components  are  non-empty  (e.g.,  for  components  Lx  and  L2),  those  two 
edges  together  are  drawn  as  a  single  two-headed  arrow,  and  each  poset  is  positioned  closer 
to  its  source  (e.g.,  A"  :  N[]  is  on  edge  (L1,L2)).  The  non-faulty  lieutenants  receive  two 
unchanged  values  and  one  changed  value.  Voting  masks  the  changed  input,  leaving  the 
non-faulty  lieutenants’  outputs  unchanged.  It  is  easy  to  check  that  this  run  satisfies  the 
fault-tolerance  requirement.  The  analysis  of  failure  of  Lx  or  L  >  is  a  symmetric  variant  of  this 
analysis. 

Comments  on  the  Definition  of  Ltnt.  A  faulty  lieutenant  may  send  arbitrary  values 
at  any  time,  even  before  it  receives  or  sends  messages  in  the  non-faulty  execution.  In  the 
run  step F(nfBA,  fs l)(-L Run)  shown  in  Figure  4.11,  the  ms-atoms  on  the  outgoing  edges  of  L2 
represent  those  outputs;  note  that  these  ms-atoms  are  equal  to  arbnew ,  defined  by  (4.14). 
In  the  fixed-point  shown  in  Figure  4.10,  some  of  these  ms-atoms  have  been  “replaced”  by 
ms-atoms  in  Lper  containing  arbitrary  changes;  in  other  words,  messages  once  represented  by 
the  former  are  later  represented  by  the  latter.  The  omission  of  those  occurrences  of  arbnew 
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when  the  lieutenant  receives  inputs  is  a  design  decision;  it  would  also  have  been  correct 
to  define  relay  so  it  retains  those  ms-atoms.  The  freedom  to  do  either  (and  still  have  a 
monotonic  input-output  function)  reflects  the  flexibility  of  our  definition  of  <h%ufc'-  if  does 
not  introduce  an  artificial  separation  between  perturbed  behavior  and  new  behavior.  It  is 
easy  to  check  that  the  run  in  Figure  4.10  satisfies  the  fault-tolerance  requirement. 


Figure  4.11:  The  run  step F ( nfBA  JsL)(±Run). 


Commander  Failure.  Let  fsc  be  the  failure  scenario  in  which  only  the  commander  is 
faulty.  Figure  4.12  shows  the  run  runFc(nfBA)(fsc)-  Since  each  lieutenant  L,  relays  the 
same  value  A',  to  the  other  two  lieutenants,  all  three  lieutenants  apply  the  majority  function 
to  the  same  sequence  of  values  when  they  compute  their  output.  It  is  easy  to  check  that  the 
run  in  Figure  4.12  satisfies  the  fault-tolerance  requirement. 


Chapter  5 


Fault-Tolerance  for  Moving  Agents 


An  interesting  paradigm  for  programming  distributed  systems  is  moving  agents.  In  this 
paradigm,  an  agent  is  not  tied  to  a  particular  site  but  rather  moves  from  site  to  site  in  a 
network.  For  example,  an  agent  that  starts  at  site  S  might  move  to  site  Si  in  order  to  access 
some  service  (e.g.,  a  database)  available  there.  The  agent  might  then  determine  that  it  needs 
to  access  a  service  located  at  site  S2  and  move  there.  If  the  agent  has  gathered  all  of  the 
information  it  needs,  it  might  finish  by  moving  to  a  final  site  A  to  deliver  the  result  of  the 
computation  (A  may  be  the  same  as  S). 

For  our  current  purposes,  it  does  not  matter  whether  code  is  shipped  from  site  to  site;  the 
essential  points  are  that  the  thread  of  control  moves  from  site  to  site,  and  that  the  sequence 
of  services  used  by  a  moving  agent  is  generally  not  known  when  the  computation  starts, 
since  it  may  depend  on  information  obtained  as  the  computation  proceeds. 

There  are  two  fault-tolerance  issues:  protecting  moving  agents  from  faulty  servers  and 
protecting  servers  from  faulty  moving  agents.  This  chapter  examines  protocols  for  protecting 
moving  agents  from  servers  that  may  suffer  Byzantine  failures. 

5.1  Fault- tolerance  for  Moving  Agents 

To  illustrate  new  problems  that  arise  with  moving  agents,  we  consider  a  two-stage  moving 
agent,  analogous  to  the  two-stage  pipeline  discussed  in  Chapters  2  and  3.  By  analogy  with 
the  replicated  pipeline,  one  might  try  to  make  a  two-stage  moving  agent  fault-tolerant  by 
having  it  access  multiple  replicas  of  each  service,  with  a  majority  vote  on  the  results  after 
the  final  stage.  This  coresponds  to  the  moving  agent  shown  in  Figure  5.1:  it  starts  at  a 
source  S ,  accesses  service  F,  which  is  replicated  at  sites  Fi,F2,F3,  then  accesses  service 
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G,  which  is  replicated  at  sites  Gi,G2,G3.1  Since  G  is  the  last  service  it  needs,  the  agent 
moves  to  a  consolidator  B.  which  is  responsible  for  delivering  the  result  of  the  computation 
to  the  “actuator”  A.  Like  a  voter,  the  consolidator  computes  the  majority  of  the  values  it 
receives  and  sends  the  result  to  the  actuator;  in  addition,  as  discussed  in  detail  below,  the 
consolidator  uses  an  authentication  mechanism  to  determine  which  values  are  invalid  and 
should  be  excluded  from  the  vote. 


Figure  5.1:  Run  of  replicated  two-stage  moving  agent. 


A  typical  moving  agent  accesses  only  some  of  the  available  services.  To  reflect  this, 
the  system  shown  in  Figure  5.1  includes  a  service  H  comprising  replicas  H\  not  used 
by  this  particular  agent.  The  fault-tolerance  requirement  for  this  system  is  that  inputs  to 
the  actuator  should  be  unaffected  by  Byzantine  failure  of  a  minority  of  the  replicas  of  each 
service  used  by  the  moving  agent  and  by  Byzantine  failure  of  all  replicas  of  each  service  not 
used  by  the  moving  agent. 

Faulty  components  can  spoof  (i.e. ,  send  messages  that  appear  to  be  from  other  compo¬ 
nents)  and  eavesdrop  (i.e.,  obtain  copies  of  messages  sent  to  other  components).  From  the 
perspective  of  the  recipient  of  a  message,  the  possibility  of  spoofing  causes  uncertainty  about 
the  identity  of  the  sender  of  the  message,  since  the  message  might  not  be  from  the  purported 

1Of  course,  S,  F\,  etc ,  here  name  denote  different  components  than  in  previous  examples:  the  mapping 
from  names  to  input-output  functions  is  different. 
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sender.  This  uncertainty  is  modeled  in  our  framework  by  using  input-output  functions  that 
are  independent  of  the  purported  names  of  the  senders  in  the  input  history.  A  simple  way 
to  ensure  this  is  to  use  only  input-output  functions  of  the  form 

(A fail\S.  (Xh:  Hist.  f(fail,\JX£Nameiri(h(x)))), 

where  S  C  Fail  and  /  G  (S  x  Set(L))  — >  Hist.  Since  we  hide  the  information  that  identifies 
the  sender  of  each  message,  the  accuracy  of  this  information  is  irrelevant. 

Eavesdropping  is  modeled  in  a  similar  manner  to  Missing-Message-Detection  in  Section 
4.2.  A  faulty  component  (the  “eavesdropper”)  can  send  a  special  value  evsdrp.  The  output 
of  a  component  that  receives  this  value  must  contain  a  possibility  of  sending  copies  of  all 
subsequent  outputs  to  the  easvesdropper.2  In  examples,  we  assume  a  server  is  able  to  eaves¬ 
drop  on  all  components  except  actuators;  actuators  communicate  only  with  consolidators. 
This  convention  is  used  at  both  the  concrete  and  abstract  levels,  so  we  treat  evsdrp  as  both 
a  concrete  value  and  an  abstract  value,  with  [ evsdrp\ AVal  =  {evsdrp}. 

Consider  the  consolidator  B  in  Figure  5.1.  How  does  it  decide  which  inputs  are  valid  (i.e., 
should  be  included  in  majority  votes)?  One  might  be  tempted  to  say  that  the  consolidator 
should  treat  messages  from  Gj .  G2,  and  G3  as  valid  and  messages  from  other  components  as 
invalid.  This  proposal  is  inappropriate  for  moving  agents,  because  it  assumes  the  consolidator 
knows  in  advance  that  the  last  service  visited  by  the  moving  agent  will  be  service  G — the 
sequence  of  services  visited  by  a  moving  agent  is  generally  not  known  in  advance. 

At  the  other  extreme,  suppose  the  consolidator  considers  all  inputs  valid.  Whenever 
the  consolidator  receives  the  same  value  from  a  majority  of  the  replicas  of  some  service,  it 
sends  that  value  to  the  actuator.  Due  to  the  possibility  of  spoofing,  checking  that  a  value 
was  received  from  different  replicas  of  the  same  service  requires  cryptographic  techniques, 
as  discussed  below.  This  scheme  tolerates  some  failure  scenarios  but  not  those  involving  the 
failure  of  services  not  used  by  the  moving  agent.  For  example,  if  Hi  alone  fails,  then  with 
this  scheme,  the  consolidator  would  ignore  any  incorrect  values  that  Hi  sends,  since  none  of 
the  other  replicas  of  service  H  would  send  incorrect  values.  If  H\ .  1I>.  and  11?,  all  fail  and 
send  the  same  incorrect  value  to  the  consolidator,  then  with  this  scheme,  the  consolidator 
would  send  that  incorrect  value  to  the  actuator. 

There  are  various  ways  to  fix  this  problem.  Informally,  in  a  computation  with  failures, 
a  message  is  considered  valid  if  it  has  visited  the  same  sequence  of  services  as  a  message 

2For  this  purpose,  we  allow  an  exception  to  the  rule  in  the  previous  paragraph;  since  evsdrp  is  not  really 
sent  in  messages,  spoofing  is  irrelevant. 
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in  the  failure-free  computation.  We  consider  a  protocol  in  which  digital  signatures  [RSA78] 
are  used  by  the  consolidator  to  reliably  determine  this.3  We  assume  digital  signatures  are 
implemented  using  public-key  cryptography  and  that  each  component  knows  its  own  private 
key  and  the  public  key  of  every  other  component.  We  also  assume  each  component  knows 
which  service  is  provided  by  each  component. 

Each  message  is  signed — to  foil  spoofing  by  faulty  components — and  augmented  with 
information  about  the  sequence  of  services  that  should  have  been  visited  and  about  the 
sequence  of  services  actually  visited.  For  the  former,  each  source  and  server  include  in  each 
outgoing  message  the  next  service  to  be  visited  by  that  moving  agent.  More  specifically,  a 
source  includes  in  outgoing  messages  the  first  service  (or  consolidator,  if  no  services  are  being 
used)  to  be  visited  by  the  moving  agent,  and  a  server  includes  in  each  outgoing  message  the 
service  (or  consolidator)  to  be  visited  next  by  the  moving  agent  embodied  in  that  message. 
The  consolidator  will  need  to  verify  the  entire  “history”  of  the  moving  agent  (i.e. ,  the  entire 
sequence  of  visited  services),  so  a  server  x  also  includes  in  the  outgoing  message  one  of  the 
incoming  messages  that  embodied  the  arrival  of  that  moving  agent  at  x;  by  induction,  that 
incoming  message  contains  the  history  of  the  moving  agent  up  to  the  arrival  of  the  moving 
agent  at  x. 

Recall  that  sources  and  servers  sign  every  outgoing  message.  The  signatures  both  prevent 
lying  about  the  sequence  of  services  that  should  have  been  visited  (e.g.,  prevent  tampering 
with  the  input  message  included  in  the  output  message)  and  document  the  sequence  of 
services  actually  visited.  The  consolidator  tests  validity  of  a  message  by  checking  that  it 
was  originated  by  a  source,  that  the  consolidator  itself  is  the  declared  destination  of  the 
message,  and  that  the  sequence  of  declared  destinations  in  the  message  is  consistent  with 
the  signatures.  Of  course,  the  consolidator  also  checks  each  of  the  signatures  and  considers 
the  message  invalid  if  any  of  those  checks  fail.  We  say  a  set  of  messages  is  valid  if  each 
message  is  valid,  all  the  messages  visited  the  same  sequence  of  services  and  contain  the  same 
data  value,  and  the  messages  collectively  are  last  signed  by  a  majority  of  the  replicas  of  some 
service.  When  the  consolidator  receives  a  valid  set  of  messages,  it  forwards  the  common  data 
value  to  the  actuator. 

We  now  describe  this  protocol  in  more  detail.  A  message  sent  by  a  source  with  private 
key  k  has  the  form 

sign((data ,  dest),  k )  (5.1) 

3  A  protocol  based  on  shared  secrets  is  described  in  [MvRSS96]. 
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where  data  is  the  data  carried  by  the  message,  dest  is  the  first  service  or  consolidator  to  be 
visited,  and  sign(x ,  k )  represents  x  digitally  signed  with  key  k.  For  convenience,  we  assume 
each  source  initiates  at  most  one  moving  agent;  this  restriction  is  easily  removed  by  including 
(say)  a  sequence  number  in  data  of  (5.1).  A  message  received,  processed,  and  forwarded  by 
a  server  x  with  private  key  k  has  the  form 

sign((data ,  dest ,  ???.),  k )  (5.2) 

where  data  and  dest  are  as  before,  and  rn  is  the  original  message  received  by  x  that  caused 
it  to  send  this  message. 

To  conveniently  describe  protocols  that  send  messages  of  these  forms  in  our  framework, 
we  introduce  some  notation.  Let  Data  G  AVal  represent  the  data  values  carried  by  moving 
agents,  and  let  Key  C  CVal  represent  the  cryptographic  keys  used  for  digital  signatures.  For 
cv  G  CVal  and  k  G  Key ,  the  concrete  value  sign(cv ,  k )  is  cv  signed  with  key  k. 

Let  Src  C  Name  be  the  set  of  sources,  i.e. ,  the  unreplicated  components  trusted  to 
initiate  moving  agents.  Let  Svc  C  Con  be  the  set  of  (names  of)  services.  We  require  that 
different  elements  of  Svc  denote  different  services  at  the  concrete  level;  this  is  similar  to 
the  Uniqueness  requirement  for  messages  in  the  reliable  broadcast  example  in  Section  4.1.1. 
Formally,  this  requirement  means  that  we  consider  only  partial  interpretation  of  constants 
satisfying 

(' \/svc ,  svc'  G  Svc  :  svc  ^  svc'  =>  pa(svc )  ^  pa(svc')).  (5.3) 

For  convenience,  we  assume  each  server  offers  a  single  service.  Thus,  there  is  a  function 
provides  G  Name  — >  ( Svc  U  {_L})  that  returns  the  service  provided  by  a  component;  it 
returns  T  for  components  (such  as  sources  and  actuators)  that  don’t  provide  a  service. 
Each  consolidator  offers  its  own  unique  consolidating  service;  that  is,  associated  with  each 
consolidator  x  is  some  service  provided  only  by  x. 

Messages  of  the  forms  (5.1)  and  (5.2)  are  represented  by  an  abstract  value  Msg  G  AVal , 
whose  meaning  is  the  smallest  set  satisfying 

1Ms91a vai  =  let  So  =  lDatajAVal  x  Svc  (5.4) 

in  let  S  =  \Data\AVal  x  Svc  x  \Msg\AVal 
in  UkeKey  ~Sign(So  U  S,  k)t 

where  sign  is  the  pointwise  extension  of  sign  with  respect  to  its  first  argument.  To  construct 
messages  of  the  forms  (5.1)  and  (5.2),  we  introduce  constants  msg 0  G  Con  and  msg  G  Con , 
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respectively,  with  interpretations 

Pa(ms9 o)  =  (A (k,  data,  dest) :  Key  x  \Data\AVal  x  Svc. 
sign({data ,  dest),  k)) 

pa(msg)  =  (X(k,  data,  dest,  msg) :  Key  x  \Data\AVal  x  She  x  \Msg\AVaV 
sign((data ,  dest,  msj),  A:)) 

A  set  KC  C  Con  of  key  constants  is  used  to  represent  the  keys  in  Key.  We  adopt  the 
following  convention:  for  x  G  Name ,  Kx  G  KC  represents  x's  private  key.  Since  these  are 
the  only  keys  we  are  interested  in,  we  take  KC  =  U xeName  Kx.  We  require  that  private  keys 
be  distinct.  Thus,  we  assume  the  partial  interpretation  pa  of  constants  satisfies 

A(VA:c  G  KC  :  pa(kc)  G  Ae?y) 

A(VA-Ci  G  KC  :  (Vfec2  G  AW  :  Axq  ^  kc2  =X  pa(A'Ci)  ^  pa(A:c2))) 

Define  prm  G  KC  Name  (short  for  “principal”)  by  prin(Kx)  =  x. 

The  processing  done  by  a  service  svc  G  Svc  is  represented  by  an  operator  svc  G  Con. 
For  example,  if  a  moving  agent  carrying  symbolic  data  value  v  visits  service  F,  the  moving- 
agent  will  leave  carrying  symbolic  data  value  F(v). 

The  behavior  of  this  protocol  for  the  two-stage  moving  agent  considered  above  is  shown 
in  Figure  5.2,  where 


m°  =  msg0(Ks,X,F ) 

(5,6) 

m]  =  msg  ( I<F.  ,F(X),G,  m° ) 

(5,7) 

=  msg ( KG. ,G(F(X)),B ,  m] ) 

(5.8) 

where  X  G  Var(S)  represents  the  data  sent  by  the  source  S.  To  see  that  this  protocol 
prevents  spoofing  by  faulty  replicas  of  service  H ,  consider,  for  example,  the  case  where 
those  faulty  components  obtain  a  copy  of  message  m°  by  eavesdropping,  and  then  send 
the  consolidator  messages  containing  m° .  The  consolidator  will  find  these  messages  invalid, 
because  they  are  not  signed  by  providers  of  the  declared  destination  of  m°  (namely,  service 
F).  Of  course,  attempts  by  the  faulty  replicas  of  service  H  to  change  the  declared  destination 
of  m°  will  cause  the  consolidator’s  check  of  the  source’s  signature  to  fail. 

5.1.1  Voting  After  Each  Stage 

This  protocol  provides  some  fault-tolerance  but  does  not  satisfy  the  fault-tolerance  require¬ 
ment  on  page  106,  namely,  that  a  moving  agent  tolerate  simultaneous  Byzantine  failure  of  a 
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Figure  5.2:  Run  of  replicated  two-stage  moving  agent,  with  authentication. 


minority  of  replicas  of  each  service  it  uses.  A  protocol  that  uses  only  the  sparse  pattern  of 
communication  shown  in  Figure  5.1  or  Figure  5.2  cannot  satisfy  this  requirement,  because 
the  effects  of  failures  in  different  stages  are  cumulative.  For  example,  the  above  protocol 
does  not  tolerate  simultaneous  failure  of  iq  and  G2,  because  then  two  of  the  consolidator’s 
three  inputs  might  be  corrupted  by  the  failures.  Incidentally,  the  same  argument  applies  to 
the  two-stage  pipeline  of  chapter  3. 

To  make  computation  more  robust,  each  server  can  send  its  outgoing  messages  to  all 
replicas  of  the  next  service,  instead  of  just  one,  and  validity  tests  and  voting  are  incorporated 
into  each  stage  of  the  computation  after  the  first.4  The  validity  test  and  voting  are  just  as 
described  for  consolidators. 

This  change  to  the  protocol  requires  one  clarification.  Recall  that  a  server  included  in 
each  outgoing  message  the  unique  incoming  message  that  caused  it  to  send  that  message. 
Now,  a  server  sends  messages  only  after  receiving  valid  messages  with  the  same  value  from 
a  majority  of  the  replicas  of  some  service;  so,  we  add  that  the  server  may  include  in  the 
outgoing  messages  any  one  of  these  incoming  messages.5 

intermediate  levels  of  fault-tolerance  can  be  achieved  by  voting  after  every  few  stages,  rather  than  after 
every  stage. 

5  The  reader  who  wonders  whether  more  than  one  incoming  message  should  be  included  in  the  outgoing 
message  is  referred  to  the  comments  below. 
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Note  that  the  only  remaining  difference  between  a  server  and  a  consolidator  is  that 
a  consolidator  does  not  perform  application-specific  computation  (i.e. ,  does  not  apply  an 
operator)  and  does  not  include  authentication  information  in  its  outputs  (it  sends  unadorned 
data  values  to  the  actuator). 

The  resulting  pattern  of  communication  is  shown  in  Figure  5.3.  Each  Gj  might  include 
any  one  of  its  three  input  messages  in  its  output,  so  the  value  it  outputs  is  msj  x  {Msg}, 
where 

msj  =  (5.9) 

where  mj  j  is  a  message  signed  by  F)  then  Gf 

mlj  =  msg(KGj,G(F(X)),  B,  m]).  (5.10) 


Figure  5.3:  Run  of  replicated  two-stage  moving  agent,  with  authentication  and  with  voting- 
after  each  stage.  Each  skewed  ms-atom  labels  each  of  the  three  edges  it  crosses. 


Comments  on  the  protocol.  Before  proceeding  with  the  modeling  and  analysis  of  the 
protocol  sketched  above,  we  observe  that  the  protocol  does  not  satisfy  the  fault-tolerance 
requirement  on  106.  Modifying  the  protocol  to  satisfy  this  fault-tolerance  requirement  is 
straightforward  (see  Section  5.3.3).  We  choose  to  analyze  this  protocol — father  than  a  more 
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robust  one — for  two  reasons.  First,  it  seems  appropriate  to  “test”  the  analysis  on  an  incor¬ 
rect  protocol.  The  analysis  of  this  protocol  then  demonstrates  both  positive  and  negative 
results;  specifically,  the  analysis  shows  that  the  protocol  enables  a  moving  agent  to  tolerate 
Byzantine  failure  of  a  minority  of  the  replicas  of  each  service  used  by  the  moving  agent  or 
Byzantine  failure  of  all  replicas  of  each  service  not  used  by  the  moving  agent,  but  not  both 
simultaneously.  Second,  this  development  reflects  our  original  expectations:  the  protocol 
was  developed  and  analyzed  with  the  idea  that  it  would  satisfy  the  original  fault-tolerance 
requirement,  but  the  analysis  proved  our  expectations  incorrect. 


5.1.2  The  Effects  of  Byzantine  Failures 

In  the  analysis  of  Byzantine  agreement  in  Section  4.2,  we  assumed  a  Byzantine- faulty  com¬ 
ponent  could  output  any  value.  This  behavior  was  modeled  using  the  abstract  value  Ty. 
This  model  of  Byzantine-faulty  components  is  too  coarse  when  cryptography-based  proto¬ 
cols  are  being  analyzed.  A  fundamental  assumption  of  cryptography  is  the  infeasibility  of 
malicious  entities  to  randomly  guess  certain  kinds  of  information,  such  as  cryptographic  keys 
or  signed  messages.  Of  course,  Ty  represents  all  concrete  values,  including  keys  and  signed 
messages.  To  reflect  the  infeasibility  of  malicious  entities  to  randomly  guess  certain  values, 
we  need  to  use  in  the  outputs  of  Byzantine-faulty  components  abstract  values  that  represent 
all  concrete  values  that  can  be  generated  from  specified  sets  of  cryptographic  information. 
The  cryptographic  information  known  by  a  faulty  component  is  specified  by  a  set  of  keys, 
represented  by  key  constants,  and  a  set  of  signed  messages,  represented  by  elements  of  SMsg , 
which  is  the  smallest  set  satisfying 


SMsg  =  SMsg0  U  U  { msg(kc ,  data ,  dest,  m)},  (5.11) 

kc£KC,data£SValo,dest£Svc,m£SMsg 


where 

SMsg 0  =  (J  {msg0(kc,  data ,  dest)}.  (5.12) 

kcEKO ,dataESValo,destESvc 

For  kcs  C  KC  and  ms  C  SMsg ,  the  abstract  value  Arb(kcs,  ms )  represents  all  concrete 
values  that  can  be  generated  from  the  specified  cryptographic  information.  To  formalize 
this,  we  first  consider  the  meanings  of  elements  of  KC  and  SMsg. 

For  elements  of  KC .  we  assume  the  meanings  are  given  by  a  partial  interpretation  of 
constants  satisfying  (5.5).  When  giving  the  meaning  of  Arb(kcs,ms ),  we  must  consider 
not  only  keys  (i.e. ,  elements  of  Key)  that  are  represented  by  key  constants  but  also  keys 
that  are  not  represented  by  any  key  constant.  To  show  soundness  of  a  protocol,  it  suffices  to 
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assume  that  Byzantine-faulty  components  cannot  randomly  guess  cryptographic  information 
generated  from  keys  being  used  by  non-faulty  components.  Of  course,  those  keys  are  all 
represented  by  key  constants.  Thus,  it  does  not  matter  for  our  purposes  whether  keys 
not  represented  by  key  constants  can  be  guessed  by  faulty  components.  For  generality,  we 
assume  they  can  be.  Accordingly,  we  take  Arb(kcs,  ms)  to  represent  concrete  values  that 
can  be  generated  using  keys  explicitly  represented  in  kcs  and  keys  not  represented  by  any 
key  constant. 

For  elements  of  SMsg ,  the  only  subtlety  involves  the  symbolic  value  data ,  which  may 
contain  variables.  In  Chapter  3,  the  meaning  of  an  abstract  value  is  independent  of  the 
interpretation  of  variables.  This  restriction  forces  an  overly  coarse  approximation  here,  since 
it  implies  (for  example)  that  Arb(kcs ,  { msg0(S ,  X,  F)})  must  represent  messages  from  source 
S  to  service  F  containing  an  arbitrary  data  value.  Thus,  in  this  approximation,  if  Arb  is  used 
to  represent  the  outputs  of  faulty  components,  those  components  would  appear  to  be  able 
to  change  the  data  in  a  message  without  invalidating  the  signature.  To  avoid  this  problem, 
we  allow  the  meaning  of  abstract  values  to  depend  on  the  interpretation  of  variables. 

This  requires  only  minor  changes  to  the  framework  described  in  Section  3.1.  We  parame¬ 
terize  ^  a  partial  interpretation  of  symbols;  thus,  instead  of  OuvaZ  £  Interp Set(AVal) , 

we  have  \iAVal  G  interp (Sym )  — h-  Interp Set(AVal) ,  where  the  ordering  on  Interp Set(AVal)  is 

Pi  —  Interp Set(S)  P2  =  (Vs  G  S  :  pi(s)  ±  0  =4>  pi(s)  =  p2(s))-  (5.13) 

Since  p  is  a  partial  interpretation,  it  might  not  give  values  for  all  of  the  symbols  on  which 
the  meaning  of  an  abstract  value  a  depends.  For  technical  convenience,  instead  of  using  a 
partial  function,  we  encode  this  undefinedness  by  taking  =  0  in  those  cases.  Note 

that  with  this  correspondence  in  mind  (namely,  pi(s)  =  0  corresponds  to  s  ^  dom(p),  (5.13) 
is  essentially  the  same  as  (2.60).  The  monotonicity  and  continuity  requirements  for 
ensure  that  the  meaning  of  an  abstract  value  a,  once  “defined”  (i.e.,  non-empty),  is  not 
changed  by  extending  p. 

To  check  that  these  changes  are  reasonable,  we  also  give  the  revised  definition  of  the 
meanings  of  posets  of  ms-atoms,  etc.  The  definitions  in  Section  2.4.1  are  mostly  unchanged, 
the  sole  exception  being  the  definition  of  compatVal  in  (2.61),  to  which  we  add  a  p: 

compatpVal(val ,  cv)  =  (3{.s,a)  G  val  :  A  cv  E  (5-14) 

A  s  =  _  V  cv  =  p(s)). 

Now,  note  that  for  an  abstract  value  a  with  =  0,  the  condition  cv  G  does 

not  hold,  so  such  abstract  values  are  effectively  ignored  when  determining  the  meaning  of 
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a  poset  of  ms-atoms,  just  as  uninterpreted  symbols  are  effectively  ignored.  Monotonicity  of 
\\pAVal  with  respect  to  p  ensures  that  Q^osepL)  and  []//«<  are  shill  monotonic  with  respect  to 
p.  With  this  observation,  the  proof  of  soundness  goes  through  just  as  before. 

Returning  to  the  meaning  of  messages  and  Arb ,  we  parameterize  both  by  a  partial  in¬ 
terpretation  p  G  interp(Sym)  whose  interpretation  of  constants  is  expected  to  satisfy  (5.5), 
(5.5),  and  (5.5).  We  define 

\msga(kc ,  data ,  dest)]pSMsg  =  if  p(d)  Data  then  0  (5.15) 

else  {p(msg0)(p(kc),p(d),  dest )} 

{msg(kc,  data ,  dest,m)]pSMsg  =  if  p(d)  ^  Data  then  0  (5.16) 

else  U cveln§SMsg{p(msg)(p(kc),p(d),  dest ,  cv)} 

where  p  is  given  by  (2.61). 

It  is  convenient  to  extend  to  all  symbolic  values,  by  defining,  for  s  G  SVal  \  SMsg , 

WsMsg  =  iMsgJAVaV  (°-17) 

This  is  a  conservative  approximation:  symbolic  values  not  in  SMsg  are  treated  as  a  com¬ 
pletely  arbitrary  messages.  Note  that  we  have  elided  the  p  in  \Msg\pAVal,  since  the  meaning 
of  Msg  is  independent  of  p;  we  use  this  notation  for  other  abstract  values  as  well. 

Finally,  for  kcs  C  KC  and  ms  C  SVal ,  the  meaning  of  Arb(kcs,  ms)  is  the  least  set 
satisfying 

l Arb(kcs ,  ms)fAVal  =  let  keys  =  (\Jkcekcs{p(kc)})  U  (Key  \  U kcekcs{p(kc)})  (5.18) 

in  Name  U  \Data\AVal  U  Svc  U  keys  U  (UmGms  lm]SMsg) 
(Uk€keyia€[Arb(tte»,m» jjjj^ ,  {sign(x,k)}). 

Thus,  the  abstract  values  used  in  this  analysis  are 

AVal  =  {Data,  Msg}  U  Arbs  U  AMul  (5.19) 

where 

Arbs  =  (J  {Arb(kcs,  ms)}  (5.20) 

kcsCKC,msCSVal 

AMul  =  {?,!,*}  (5.21) 
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5.2  Input-Output  Functions 

5.2.1  Input-Output  Function  for  Servers 

The  input-output  function  server  (kc,  svc,  n,  next,  nbrs)  represents  a  server  with  a  private 
key  represented  by  kc  G  KC  and  that  provides  service  svc  G  Svc.  Parameter  n  G  N  is 
the  (minimum)  number  of  replicas  of  each  service  in  the  system:  the  server  looks  for  a 
valid  set  of  inputs  of  size  [ (n  +  l)/2]  before  processing  a  moving  agent.  To  partially  reflect 
the  dependence  of  the  remaining  path  of  a  moving  agent  on  information  stored  by  servers, 
the  description  of  a  server  also  specifies  the  next  service  next  G  Svc  normally  visited  by 
moving  agents  that  visit  that  server.6  To  further  reflect  this  dependence  and  the  possibility 
that  messages  from  faulty  components  might  “confuse”  a  server,  if  there  is  no  (symbolic) 
majority  among  the  input  values  corresponding  to  a  moving  agent,  then  we  assume  that  the 
next  destination  for  this  moving  agent  might  be  any  component  in  nbrs  C  Name.  If  the 
server  is  faulty,  it  sends  arbitrary  messages  to  and  eavesdrops  on  the  components  in  nbrs. 

Inputs  corresponding  to  moving  agents  initiated  by  different  sources  can  be  processed 
separately,  so  we  write  the  input-output  function  for  a  server  as 

server(kc,  svc ,  n,  next ,  nbrs)  = 

(A fail:  {OK ,  ByzFail}.  ( Xh:Hist .  (Xx:  Name. 
let  Ibis  =  UyeNameMHy)) 

in  let  evsdrprs  =  {x  G  Name  \  evsdrp  G  (*(a;)))  ^(v)} 

in  if  fail  =  OK  then  ({jsrcesrc  server\(kc,  svc ,  n,  next ,  src ,  Ibis,  x,  evsdrprs ),  0) 
else  if  x  G  ( evsdrprs  \  nbrs)  then  ({(*,  mkArb({kc] , ^(Ibls)),  0)},  0) 
else  if  x  G  nbrs  then  ({(*,  {evsdrp,  mkArb({kc}tW2(lbls))},  0)},  0) 
else  (0,  0)))) 

(5.22) 

where  server \(kc,  svc,  n,  next,  src,  Ibis,  x,  evsdrprs)  G  Set(L)  is  the  set  of  output  ms-atoms 
to  component  x  G  Name  that  represent  moving  agents  initiated  by  source  src  G  Src,  where 
evsdrprs  C  Name  is  the  set  of  eavesdroppers,  Ibis  C  L  is  the  set  of  input  ms-atoms,  and  the 
first  four  parameters  are  as  for  server.  The  definition  of  server i  appears  in  Figure  5.4  and 
is  discussed  in  the  following  subsections.  The  function  mkArb  G  Set(KC)  x  Set(Val)  — >  Val 
returns  an  element  of  Arbs  that  represents  all  concrete  values  that  can  be  generated  from 

6It  is  straightforward  to  let  the  next  service  normally  visited  depend  on  information  carried  by  the  moving 
agent,  though  abstracting  from  the  identity  of  that  service  is  difficult,  as  discussed  in  section  5.4. 
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the  concrete  values  represented  by  the  specified  key  constants  and  values: 

mkArb(kcs ,  vs)  =  let  arbs  =  \JvevsK2(v)  n  Arts  (5.23) 

in  let  kcs1  =  kcs  U  (U„era  7f7(c)  n  KC )  U  (U Arb(kcs',m.s)earbs  kcs') 
in  let  ms  =  (U Arb(kcs',ms)earbs  ms)  U  n  SMsg) 

in  Arb(kcs\,  unpack  (ms))} , 

where  unpack  G  SMsg  — >  Set(SMsg)  is 

unpack (m)  =  match  m  with  (5.24) 

| msg (_,_,_,  ??r')  — >  {m}  U  unpack  (m1) 

!--»■  {™}, 

and  where  unpack  G  Set(SMsg)  — >  Set(SMsg)  is  defined  by 

unpack  (S)  =  Umes  unpack  (S).  (5.25) 

The  use  of  unpack  reflects  a  faulty  component’s  ability  to  extract  pieces  of  messages  and 
incorporate  them  in  its  outputs. 

Determining  Set  of  Inputs  that  Contribute  to  the  Output 

In  the  definition  of  server i,  the  value  of  mmes  (mnemonic  for  “(multiplicity,  (message, 
extension) ) ’s” )  summarizes  the  sets  of  input  messages  that  can  contribute  to  the  output  of 
the  server.  Roughly,  each  element  (muL  mes)  of  mmes  corresponds  to  a  possible  quorum, 
i.e. ,  a  valid  set  of  input  messages  of  appropriate  size.  The  first  component,  mul,  is  1  if  the 
inputs  described  by  mes  definitely  represent  a  quorum  (hence  definitely  cause  an  output) 
and  is  ?  otherwise.  Our  definition  of  mmes  has  the  property  that  if  (1,  mes)  G  mmes ,  then 
also  (?,  mes)  G  mmes ;  this  is  not  necessary  for  correctness  of  the  subsequent  definitions,  but 
it  does  no  harm. 

Second  component  mes  describes  a  set  of  messages  that  might  have  been  received  by  the 
server.  If  an  input  ms-atom  contains  Arb(kcs,  ms ),  a  message  in  ms  may  be  extended  with 
signatures  by  keys  in  kcs.  Such  an  extended  message  is  represented  by  a  pair  (???,,  ext),  where 
m  G  SMsg  and  where  the  extension  ext  G  Seq(Name)  is  the  additional  sequence  of  sites  by 
which  rn  has  been  signed.  As  a  special  case,  since  a  faulty  source  can  generate  valid  messages 
from  scratch,  we  allow  extended  messages  of  the  form  (_L,  ext)  with  first(ext)  G  Src.  It  does 
no  harm  to  allow  extended  messages  of  the  form  (_L,  ext)  for  arbitrary  ext  G  Seq(Name), 
since  they  will  not  satisfy  the  test  for  validity  (see  the  definition  of  valid M  below).  Thus,  we 
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serveri(kc ,  svc ,  n,  next ,  src,  /6/s,  re,  evsdrprs )  = 

let  11  1  “b  UTOaEge<MsgUW>(?j)  Um£ge<Msgs(ma)  {  |tTi  (getPath(m)  )  |  } ) 

in  let  mmes  =  {(mu/,  mes)  G  {?,  1}  x  Set(MsgExt(n ))  | 

(35  C  Ibis  :  (3h  G  mes  ^  5  : 

A  V  |mes|  =  \{n  +  l)/2]  A  getSigner(mes)  fl  Src  =  0 
V  |mes|  =  1  A  getSigner(mes)  fl  Src  ^  0 
A  (Vme  G  mes  :  (3ma  G  getMsgArb(jT2(h(me )))  : 
A7Ti(me)  G  getMsgs(ma) 

A  ^(me)  G  Seq(prin(getKeys(ma ))) 

A  ma  G  3lr6s  =>-  mul  =?)) 
a(WgS:^  7^(7n(£))  =x  |/dra”(^)|  =  1) 

A  {?,  *}  fl  ff(7fT(5))  7^  0  =b  mul  =?)) 

A  valid m (src ,  sue)(mes)} 

in  if  mmes  =  0  then  0 

else  let  ds  =  UmCse^(mma)  if  e  ^  vfj(me5)  then  {±} 

else  getData(Wi({(m ,  e#t)  G  mes  \  ext  =  c }) ) 
in  if  |ds  |  =  1  A  J_  ^  ds  then 

let  d  =  the  element  of  ds 

in  let  m,s?„  =  U mese^(mmes)  MEtoMsg(next ,  mes) 
in  let  ms  =  UTOe  ms,n{aPPly(ms9i  {(^<A  apply  (svc ,  ((d))),  ne:r/,  m)))} 
in  let  mul  =  if  1  G  Wi(mmes)  A  provides(x)  =  next  then  1  else  ? 
in  let  £  =  (mul,  ms  x  |Ms(/},0) 

in  if  provides(x)  =  next  V  x  G  evsdrprs  then  {6}  else  0 
else  (*  symbolic  output  value  not  unique;  approximate  *) 
if  x  G  nbrs  U  evsdrprs  then 

(?,  mkArb({kc} ,W2(lbls)) ,  tagOfSrc(src)) 

else  0 


Figure  5.4:  Definition  of  server 
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take  mes  to  be  a  subset  of 

MsgExt  =  (( SMsg  U  { _L} )  x  Seq(Name))  \  {(_L,t)},  (5.26) 

As  argued  below,  it  suffices  to  consider  only  extensions  of  at  most  a  certain  length  n.  so  in 
Figure  5.4,  we  take  mes  to  be  a  subset  of 

MsgExt(n)  =  ((SMsg  U  { _L})  x  Seq(Name,n ))  \  { (  .  .  x) } .  (5.27) 

where  Seq(S,n )  =  {cr  G  Seq(S)  \  \a\  <  n}. 

Inclusion  of  ( mul ,  mes)  in  mmes  is  justified  by  the  existence  of  a  set  S  C  Ibis  and  a 
correspondence  h  between  mes  and  S  that  together  satisfy  the  following  five  conditions, 
corresponding  to  the  five  conjuncts  in  the  definition  of  mmes. 

First  Condition.  The  first  condition  is: 

V  \mes\  =  |~(n  +  l)/2]  A  getSigner(mes)  n  Src  =  0  (5.28) 

V  | me# |  =  1  A  getSigner(mes)  fl  Src  ^  0. 

This  constrains  the  size  of  mes.  There  are  two  cases.  A  quorum  comprising  messages  not 
from  sources  must  be  of  size  \(n  +  1  )/2~] ;  this  corresponds  to  the  first  disjunct.  Sources  are 
not  replicated,  so  a  single  message  signed  by  a  source  suffices.  This  corresponds  to  the  second 
disjunct;  it  is  formalized  using  getSigner  G  MsgExt  — >  Name ,  which  returns  the  name  of  the 
(last)  signer  of  an  extended  message: 

getSigner  ((m ,  ext))  =  if  ext  ^  e  then  last  (ext)  (5.29) 

else  match  rn  with 

\msg0(kc,  _,  _)  — >  prin(kc) 

\msg(kc ,  _,  _,  _)  — >  prin(kc) 

where  last  returns  the  last  element  of  a  sequence.  getSigner  is  the  pointwise  extension  of 
getSigner  to  sets  of  extended  messages. 

Second  Condition.  The  second  condition  is: 

(Vme  G  mes  :  (3 ma  G  getMsgArb(7r2(h(me)))  : 

A  TTi(me)  G  getMsgs(ma)  ^ 

A  7 r2(me)  G  Seq(prin(getKeys(ma))) 

A  ma  G  Arbs  mul  =?)). 

This  requires  that  the  correspondence  h  be  such  that,  for  each  extended  message  (m,  ext)  G 


mes , 
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(i)  The  message  rn  appears  in  the  value  ir2(h(me)),  i.e. ,  m  appears  in  some  ma  G  MsgArb 

that  appears  in  7r2 (h(me)). 

(ii)  The  extension  ext  is  a  sequence  of  names  by  which  m  can  be  extended.  Specifically,  if 

rn  appears  in  Arb(kcs ,  ms ),  then  ext  is  a  sequence  of  names  in  prin(kcs);  if  rn  does  not 
appear  in  an  element  of  Arbs,  ext  is  the  empty  sequence. 

(iii)  If  rn  appears  in  7r2(h(me))  in  an  element  of  Arb ,  then  mul  =?. 

To  check  (i),  we  start  by  using  the  function  getMsgArb  G  Val  — *■  Set(MsgArb ),  which  extracts 
all  elements  of  MsgArb  that  appear  in  a  value,  after  applying  the  following  conversion.  If 
a  value  contains  the  pair  s :  Msg  with  s  ^  SMsg ,  we  have  no  information  about  the  signer, 
destination,  etc ,  of  the  message  it  represents,  so  that  pair  is  replaced  with  _:Arb(KC,  SMsg 0), 
which  represents  a  completely  arbitrary  message.  Let  unknownToArb  G  Val  — >  Val  denote 
this  conversion.  Then 

getMsgArb(v )  =  (Wt(v)  fl  SMsg)  U  ( Wi(unknownToArb(v ))  n  Arbs ).  (5.31) 

After  choosing  ma  G  getMsgArb(7r2(h(me))) ,  conditions  (i)  and  (ii)  are  checked  using  the 
functions  getMsgs  G  MsgArb  Set  (SMsg)  and  getKeys  G  MsgArb  Set(KC ),  which 
extract  the  messages  and  key  constants,  respectively,  that  appear  in  an  element  of  MsgArb: 


getMsgs  (ma) 

=  match  ma  with 

(5.32) 

\Arb(  kcs ,  ms )  — >  ms 
_  — >  {ma} 

getKeys(ma) 

=  match  ma  with 

(5.33) 

Arb  (kcs,  ms)  — >  kcs 
|_^0 

Third  Condition.  The  third  condition  is: 

(W  G  A  :  tTJM^))  =*  \hmv(b)\  =  !)• 

This  says:  if  the  multiplicity  of  a  ms-atom  isn’t  *,  then  that  ms-atom  corresponds 
one  element  of  rnes. 

Fourth  Condition.  The  fourth  condition  is: 

{?,*}  nff(7fT(S'))  ^  0  =>-  mul  =?)).  (5.35) 

This  says:  if  the  multiplicity  of  any  ms-atom  in  S  is  not  definite,  then  mul  must  be  ?. 


(5.34) 
to  at  most 


121 


Fifth  Condition.  The  fifth  condition  requires  that  the  set  mes  of  extended  messages 
satisfy  valid u{src,  svc) (mes),  where  valid m(svc,  svc)  G  Set(MsgExt)  — >  B.  Function  validja 
uses  getPath  G  SMsg  — >  (( Seq(Name )  x  Svc)  U  {_L})  to  extract  the  sequence  of  components 
visited  by  each  message  and  the  service  to  which  the  message  is  destined;  while  extracting 
this  path,  getPath  checks  that  the  signatures  match  the  destinations,  and  returns  _L  if  they 
don’t  match: 

getPath(m)  =  match  m  with 

| msg0(kc,  data,  dest )  — >  (((prin(kc))) ,  dest) 

\msg(kc ,  data ,  dest,m )  — >  match  getPath  (in')  with 

|  _L  — >  _L 

|(cr,  svc)  — >  if  svc  =  provides (prin(kc))  then 
(cr  •  ((prin(kc))) ,  desi) 

else  _L. 

(5.36) 

valid m(svc,  svc)  checks  that  the  extended  messages  all  (i)  started  from  src,  (ii)  are  destined 
for  svc ,  (iii)  visited  the  same  sequence  of  services,  and  (iv)  are  signed  by  different  replicas 
of  the  last  visited  service.  For  an  extended  message  with  a  non-empty  extension,  we  drop 
condition  (ii),  corresponding  to  the  conservative  approximation  that  the  last  destination 
contained  in  the  extension  may  be  arbitrary,  even  if  the  extension  ends  with  the  name 
of  a  non-faulty  component.  In  other  words,  as  mentioned  above,  although  a  non-faulty 
server  ordinarily  sends  moving  agents  to  next,  if  the  server  receives  arbitrary  inputs,  we 
allow  the  possibility  that  it  gets  “confused”  and  sends  the  moving  agent  to  an  arbitrary 
service.  Instead,  for  an  extended  message  ( m,ext )  with  non-empty  extension,  condition  (ii) 
is  replaced  with  the  requirement  that  m  is  destined  for  the  service  provided  by  the  first 
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element  of  ext.  Thus, 
valid M(src,  svc)(S)  = 

if  (3  (m,  ext)  G  5  :  m  /1A  getPath(m )  =  _L)  then  false 
else  let  path  =  (A(m,  exA)  iS1.  if  m  =  _L  then  ext  else  iri(getPath(m))  ■  ext ) 
in  A  (V.t  G  5  :  first  (path  (x))  =  src ) 

A  (V(m,  exA)  G  S  :  A  eh  =  £  A  7r2(getPath(m ))  =  svc 
A  ext  s  ^  V  ???.  =  _L 

V  7T2  (getPath(m))  =  provides  (first  (ext))) 

A  (V.r  i .  .r-j  G  S  :  ii  /  i2 

A  provides  (rest  (path(xi)))  G  Seq(Svc) 

A  provides  (rest  (path(xi)))  =  provides  (rest  (path(x  2))) 

A  |path(.Ti)|  >  1  last  (path  (x  i))  ^  last  (path  (x  2))), 

(5.37) 

where  provides  is  the  pointwise  extension  of  provides  to  sequences  of  names,  rest  returns  the 
“rest”  of  a  sequence  (i.e. ,  the  sequence  with  its  first  element,  if  any,  removed),  and  first  and 
last  return  the  first  and  last  element  of  a  sequence,  respectively.  It  doesn’t  matter  here  what 
first  and  last  return  on  the  empty  sequence. 

Determining  the  Server’s  Outputs 

The  result  of  server  1  is  determined  from  mmes  as  follows.  If  mmes  is  empty,  then  moving- 
agents  initiated  by  this  source  cause  no  outputs  from  this  server.  If  mmes  is  non-empty,  then 
we  gather  a  set  ds  of  symbolic  values  corresponding  to  possible  results  of  the  majority  vote  on 
the  data  values  in  the  inputs  from  the  source.  Recall  that  a  server  produces  outputs  only  if  it 
receives  the  same  value  from  a  majority  of  replicas  of  a  service  (one  can  think  of  a  source  as 
a  service  with  one  replica,  so  one  value  already  forms  a  majority);  thus,  mes  G  Set(MsgExt) 
causes  an  output  only  if  the  data  values  in  the  concrete  messages  represented  by  the  extended 
messages  in  mes  are  equal,  in  which  case  we  can  obtain  a  symbolic  value  representing  that 
common  data  value  by  using  getData  to  extract  the  symbolic  data  value  from  any  extended 
message  in  mes  whose  extension  is  empty.  The  function  getData  G  SMsg  — >  SVal0  extracts 
the  symbolic  data  from  a  message: 

getData(s)  =  match  s  with 

|ms0o(_,d,_)  ->■  d 
|  msg(-,  d.  _,  _)  — *■  d 


(5.38) 
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On  the  other  hand,  if  all  extended  messages  in  mes  have  non-empty  extensions,  then  the 
common  data  value  (if  any)  could  be  arbitrary,  so  _L  is  added  to  ds. 

If  ds  contains  a  single  symbolic  value  d.  then  d  represents  the  result  of  the  majority 
vote.  The  output  message  is  signed  by  kc,  contains  as  data  the  symbolic  value  svc(d ),  and  is 
destined  for  the  service  next .  The  set  of  possibilities  for  the  input  message  that  is  included  in 
the  output  message  is  computed  by  transforming  each  extended  message  in  U mes€wz(mmes)  mes 
into  an  element  of  SMsg  using  the  function  MEtoMsg  G  Svc  x  MsgExt  —>  SMsg ,  defined  by 

MEtoMsg(next ,  (???.,  ext))  =  if  ext  =  e  then  m  (5.39) 

else  let  svc  =  provides  (last  (ext)) 

in  let  mi  =  MEtoMsg(svc ,  (m,  allbutlast(ext))) 
in  letd  =  apply  (svc,  {(getData(m i)))) 
in  apply (msg ,  ((Klast[ext),d,  next ,  mi))), 

where  for  a  sequence  s,  allbutlast(s)  is  s  with  the  last  element  removed.  Determining  the 
multiplicity  of  the  outgoing  messages  is  straightforward.  Note  that,  since  we  are  assuming 
that  each  source  initiates  at  most  one  moving  agent,  each  server  sends  at  most  one  “burst” 
of  messages  (i.e. ,  a  message  to  each  provider  of  some  service)  corresponding  to  each  source. 

If  ds  contains  contains  T  or  contains  multiple  symbolic  values,  then  the  result  of  the 
majority  vote  is  not  known  exactly,  so  we  adopt  a  coarse  approximation,  representing  the 
output  by  the  abstract  value  mkArb({kc},W2(lbls)).  tagOfSrc  G  Src  — >  Tag  returns  a  differ¬ 
ent  tag  for  each  source;  this  ensures  that  the  ms-atoms  produced  by  different  calls  to  server \ 
from  a  single  call  to  server  are  distinct.  Note  that  it  is  safe  to  use  a  tag  of  0  in  case  d  is 
uniquely  determined,  since  then  the  input  message  msgin  included  in  the  output  ms-atom 
ensures  uniqueness  of  the  output  ms-atom. 

Finally,  we  argue  that  restricting  to  extensions  of  length  n  does  not  affect  the  result  of 
server i]  thus,  the  restriction  is  needed  not  for  soundness  but  for  termination.  Informally, 
n  is  1  more  than  the  length  of  the  longest  sequence  of  signtures  on  any  message  that  the 
server  might  have  received.  In  the  definition  of  n  in  Figure  5.4,  note  that  max  returns  the 
maximal  element  of  a  set  of  natural  numbers,  and  for  convenience,  we  define  tti(_L)  =  T 
and  |_L|  =  0.  Suppose  ( mul,  mes)  would  be  added  to  mmes  if  MsgExt(n)  were  replaced  with 
MsgExt  in  the  definition  of  mmes.  We  argue  that  this  would  not  affect  the  result  of  server \. 
Let  S  C  Ibis  be  some  set  of  ms-atoms  that  justifies  the  inclusion  of  (mul,  mes)  in  mmes. 
Since  (mul,  mes)  was  not  already  in  mmes,  mes  must  contain  an  extension  of  length  greater 
than  n.  Let  i  be  the  length  of  the  shortest  extension  in  mes.  Since  valid m(svc,  svc)(mes) 
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holds,  all  elements  of  mes  have  the  same  “path  length”,  i.e. ,  for  all  me  G  mes ,  \path(me)\  is 
the  same.  Since  a  message  contributes  at  most  n  —  1  to  the  length  of  path(me),  this  common 
“path  length”  is  at  most  (n  —  1)  +  t  and  i  >  2. 

Let  mes'  be  mes  with  the  last  i  —  1  elements  of  each  extension  removed.  We  claim  that 
(mul,  mes')  is  in  mmes.  It  is  straightforward  to  show  that  validM(src,  sue)  (mes1)  holds,  and 
that  the  same  set  S  of  ms-atoms  can  be  used  to  justify  inclusion  of  (mul,  mes')  in  mmes , 
provided  the  extensions  in  mes'  are  of  length  at  most  n.  The  common  “path  length”  for  mes' 
is  at  most  (n  —  1)  +?'—(?'  —  1),  which  simplifies  to  n,  so  the  length  of  the  longest  extension 
in  mes'  is  at  most  n.  Thus,  (mul,  mes')  is  in  mmes. 

Now  consider  the  result  of  server The  length  of  the  shortest  extension  in  mes'  is 
i  —  (i  —  1),  so  every  extension  in  mes'  is  non-empty,  so  ds  contains  _L.  So,  adding  (mul,  mes) 
to  mmes  only  provides  a  redundant  justification  for  the  inclusion  of  T  in  ds. 

5.2.2  Other  Input-Output  Functions 

Definition  of  source.  For  kc  G  KC ,  data  G  SVal ,  and  dest  G  Svc,  the  input-output 
function  source(kc,  data, dest)  represents  a  source  with  private  key  kc  that  initiates  a  moving- 
agent  by  sending  a  message  containing  symbolic  data  value  data  to  the  replicas  of  service 
desk. 


source(kc ,  data ,  dest )  =  (5.40) 

(A fail:  {OK}.  ( Ah:  Hist .  (Ax:  Name. 

if  provides(x)  =  dest  then  ({(1,  msg0(kc ,  data ,  dest) :  Msg ,  0)},  0) 
else  (0,  0)))). 

Definition  of  broker.  The  input-output  function  for  a  consolidator  is  almost  the  same  as 
that  for  a  server,  except  that  a  consolidator  does  not  apply  an  operator  to  the  outgoing  data 
or  sign  its  outgoing  messages. 

For  svc  G  Svc ,  n  G  N,  next  G  Name ,  and  nbrs  C  Name ,  broker  (svc,  7i,  next,  nbrs)  is 
given  by  the  right  side  of  (5.22)  with  server i(kc,  svc,  n,  next,  src,  Ibis,  x,  evsdrprs )  replaced 
with  consolidator i(svc,  n.  next ,  src ,  Ibis,  x,  evsdrprs ),  and  with  {Arc}  replaced  with  0  in  both 
occurrences  of  mkArb. 

In  turn,  consolidator i(n,  next ,  src ,  lbls,x ,  evsdrprs )  is  given  by  the  right  side  of  the  equa¬ 
tion  in  Figure  5.4  with: 

•  (mul,  ms  x  {Msg},  0)  replaced  with  (mul,d  :  Data,  0), 
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•  both  occurrences  of  provides (x)  =  next  replaced  with  x  =  next  (since  next  here  is  the 
name  of  the  actuator,  not  a  service  it  provides),  and 

•  mkArb(. . .)  replaced  with  Data  (since  we  assume  a  “confused”  consolidator  still  sends 
only  data  values  in  Data  to  the  actuator). 

Note  that  the  bindings  of  S,  msgin ,  and  msg  are  dead  code  in  consolidatori  and  can  be 
eliminated. 

5.3  Analysis  of  Perturbed  Behavior 

For  the  system  in  Figure  5.3,  nf  is  given  by 

nf(S )  =  source(Ks,X,F ) 

nf(Fi)  =  server (Kf.,F ,  3,  G,  nbrs  \  {F)}) 
nfiGi)  =  server (KGi,  G,  3,  B.  nbrs  \  {GJ) 
nf(B)  =  broker(B,  3,  A,  nbrs  \  {B}) 
nf(A)  =  (\fail:{OK}.  (\h:HistpQ.  (Xx:Name.  (0,0)))). 

where  nbrs  =  {f'i.  F-,.F^.G II\.  /F.  //:i.  B).  For  that  system,  provides  is  given  by 

provides  (Fj)  =  F 
provides  (Gj)  =  G 
provides  (B)  =  B. 

We  take  tagOfSrc(S)  =  0. 

5.3.1  Failure  of  Visited  Services. 

Consider  the  failure  scenario  in  which  F\  and  G2  fail.  A  straightforward  calculation  shows 
that  the  fixed-point  is  the  same  as  the  run  in  Figure  5.3,  except  that  the  outputs  of  the  faulty 
components  are  different,  and  other  components  send  messages  to  the  faulty  components  as 
a  result  of  eavesdropping.  Specifically,  for  x  G  {Fi,G2}  and  y  G  nbrs  \  {x},  the  edge  (x,y) 
is  labeled  with 

{ evsdrp ,  Arb({KFl,  Kg2 } •  ms i ) }* , 

where  ms |  =  {m°,  m\,  m\,  Also,  for  x  G  {Fi,G2}  and  y  G  nbrs  \ 

{ F\ ,  G2 } ,  the  edge  ( y,x )  is  labeled  with  all  the  output  ms-atoms  of  component  y  in  Figure 
5.3,  but  with  the  multiplicities  changed  to  ?. 


126 


To  help  the  reader  verify  this  calculation,  we  give  the  values  of  mmes  obtained  in  the  eval¬ 
uation  of  server i  and  consolidator i  when  the  fixed-point  has  been  reached;  values  of  mmes 
in  previous  iterations  of  the  fixed-point  calculation  are  subsets  of  these.  When  evaluating 
server i  for  F2  and  F3,  the  value  of  mmes  is 

U  {{mal.ii'm0,  e)})}. 

1} 

Note  that  the  inclusion  of  {{?,  {{m°,  £:}})}  in  mmes  is  justified  by  S  =  {(l,m°,  0)}  and  S  = 
{(*,  Arb({KFli  Kq2},  ms\),  0)}.  The  latter  reflects  the  possibility  that  a  faulty  component 
obtains  m°  by  eavesdropping  and  forwards  m°  to  F2  or  F3.  Since  the  network  is  asynchronous, 
F2  or  F3  might  receive  this  forwarded  copy  before  the  original  copy  sent  by  the  source. 
When  evaluating  server i  for  G\  and  G3,  the  value  of  mmes  is 

Uto?i£e{?,1  }{(muL  {(m^,e),  (mj,e)})} 

u{(?,  Ke).  W  «U»>}>,  <?,  {(<4  K, 

The  presence  of  {(m^s),  (m°,  ((ip ))) }  reflects  the  possibility  that  the  faulty  server  fp  sends 
the  correct  value  to  G\  or  6':i .  In  particular,  if  the  extended  message  (m° .  ((T\))}  contains  the 
same  data  value  as  ml,  (namely,  the  value  represented  by  F(X)),  then  these  two  messages 
can  cause  an  output;  this  is  why  they  are  included  in  mmes  and  thereby  used  to  justify  the 
inclusion  of  F( X)  in  ds.  If  (m°,  ((Ti)))  contains  a  different  value  than  ml, ,  then  these  two 
messages  do  not  cause  an  output,  so  these  two  messages  do  not  justify  adding  any  other 
data  values  to  ds.  The  explanation  for  {(///3.£).  (m°,  ((T\)))}  is  analogous. 

When  evaluating  consolidatori  for  B.  the  value  of  mmes  is 

UmnZG{?,l}  U?;,i'G{2,3}  Uj'e{l,3}  Uj'G{l,3}\{j}  {  (wul,  {  {tfl  i..j:  ^  ),  {mi'  ,j'  X)})} 
u  Ue{2,3}je{i,3}{(?1  (m°,  ((FUG2)))})} 

U  U,p£{2,3}je{i,3}{(?1  (m},,  ((G2)))})}. 

Extended  messages  with  non-empty  extensions  are  present  here  for  analogous  reasons. 

Note  that  the  voting  in  each  step  “heals”  the  effects  of  failure  of  a  minority  of  the  replicas 
in  the  previous  step.  For  example,  failure  of  T\  no  longer  perturbs  the  output  of  6-j .  so  the 
system  tolerates  failure  of  T\  and  G2. 

5.3.2  Failure  of  Unvisited  Services. 

Consider  the  failure  scenario  in  which  //, .  II2.  and  //3  fail.  A  straightforward  calculation 
shows  that  the  fixed-point  is  the  same  as  the  run  in  Figure  5.3,  except  that  the  outputs 
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of  the  faulty  components  are  different,  and  other  components  send  messages  to  the  faulty 
components  as  a  result  of  eavesdropping.  Specifically,  for  x  G  {Hi,  H2,  H3}  and  y  G  nbrs  \ 
{x},  the  edge  (x,y)  is  labeled  with 

{  evsdrp ,  Arb ({KHl ,  Ku, ,  KU:, } ,  ms2 ) }* , 

where 

ms2  =  {m0}  U  (  (J  {m\ } )  U  (  |J  {mlj}). 

ie{  1,2,3}  i,je{  1,2,3} 

Also,  for  x  G  { H\ .  H> .  H?> }  and  y  G  nbrs  \  {  // 1 .  Ilf  ,  II\ } .  the  edge  (y,x)  is  labeled  with  all 
the  output  ms-atoms  of  component  y  in  Figure  5.3,  but  with  the  multiplicities  changed  to  ?. 

5.3.3  Failure  of  Visited  and  Unvisited  Services. 

Consider  the  failure  scenario  in  which  F\ .  H L,  and  H2  fail.  As  the  reader  may  have  suspected, 
the  protocol  described  above  does  not  tolerate  this  failure  scenario.  To  help  see  why,  we  trace 
the  fixed-point  calculation.  Let  faulty  =  {Fi,  Hi,  H2}.  Let  r,  denote  the  run  obtained  after 
i  steps.  Thus,  ro  =  ARun.  Run  r\  is  the  same  as  r 0  except  as  follows: 


X 

y 

n  ((x,y)) 

{5} 

faulty 

{U,  f2.  f3} 

nbrs  \  {.x} 

{  evsdrp ,  Arb  ({ Kx } ,  0)  }* 

Run  r2  is  the  same  as  r\  except  as  follows: 


X 

y 

r2((x,y)) 

{F2,F3} 

{G\,  G2,Gz} 

m] 

{f2,f3} 

faulty 

H)7 

{Fi} 

nbrs  \  {Fi} 

{ evsdrp ,  Arb({KFl,KHl,KH2},  {???0})}* 

{HuH2} 

nbrs  \  {x} 

{  evsdrp ,  Arb ({Kh  ,  KHl ,  IiH, } ,  0) }* 

Consider  the  evaluation  of  server  1  for  F2  in  computing  run  r^.  The  value  of  mmes  is 
{{1,  {{m°,  £•)}),  {?,  {(m°,  ((Fi,  Hi))),  (m°,  {{U,  H2)))})}.  The  latter  element  corresponds  to 
the  possibility  that  Hi  and  H2  sent  their  private  keys  to  U,  which  used  its  own  key  and 
those  keys  together  to  fool  F2.  Thus,  ds  is  {A'.  T},  so  the  output  of  F2.  sent  to  nbrs  \  {F2\. 
is  an  element  of  Arbs.  The  other  non-faulty  servers  and  the  consolidator  are  also  fooled.  For 
x  G  {F2ji?3,Gi,G!2,G3}  and  y  G  nbrs  \  {r},  r3({x,y))  =  Arb({KFl,  KHl,  KH.2,  Kx},  {???°} )*. 
The  consolidator  sends  Data ?  to  the  actuator  (and  to  the  eavesdroppers),  so  the  fault- 
tolerance  requirement  is  already  violated.  Informally,  the  protocol  breaks  down  because 
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each  output  message  contains  only  one  input  message;  a  “quorum”  of  input  messages  should 
be  included  in  each  output  message. 

The  fixed-point  r  is  given  by: 


X 

y 

r({x,  y)) 

{S} 

{Fi,  F2,  PS] 

m° 

{F‘i-  F3,  Gi,  G2,  G3} 

nbrs  \  {:r} 

Arb(kcs ,  {m°})? 

faulty 

nbrs  \  {:r} 

{ evsdrp ,  Arb(kcs ,  {m0})}* 

{B} 

{4}  U  faulty 

Data ? 

where  kcs  =  U,<=  { i  .2.3}  { K .  Kq. .  K H. } ,  and  where  the  other  edges  of  r  are  labeled  with  the 
empty  set. 

5.4  Discussion 

5.4.1  Symbolic  vs.  Abstract  Values 

In  the  above  model,  we  introduced  constants  to  represent  names  of  services.  Since  symbolic 
values  are  intended  primarily  for  representing  relationships  between  values,  one  could  argue 
that  it  would  be  more  appropriate  to  use  abstract  values  to  represent  names  of  services. 
For  example,  we  could  distinguish  the  destination  service  of  a  message  in  the  abstract  value 
instead  of  the  symbolic  value,  just  as  the  sender  of  a  message  is  distinguished  in  the  abstract 
value  in  the  analysis  of  reliable  broadcast  in  Section  4.1.  We  use  constants  here,  instead  of 
abstract  values,  only  to  improve  the  appearance  of  the  formulas.  Symbolic  values  are  needed 
to  track  keys  and  data,  so  it  is  tidier  to  use  symbolic  values  for  all  parts  of  the  message, 
rather  than  splitting  the  information  between  the  symbolic  and  abstract  values.  Finally,  we 
note  that  this  issue  may  be  artificial,  since  it  may  be  possible  to  construct  a  more  flexible 
version  of  the  framework  in  which  symbolic  and  abstract  values  intermingle  within  terms. 

5.4.2  Abstracting  from  Paths 

One  limitation  of  the  above  analysis  is  that  it  does  not  allow  abstraction  from  the  path 
traveled  by  the  moving  agent.  For  example,  although  data  values  are  abstracted  (since  they 
are  represented  by  variables),  names  are  not  abstracted,  so  a  graph  like  the  one  in  Figure 
5.3  represents  only  moving  agents  that  follow  one  specific  path  through  the  network.  One 
way  to  eliminate  this  restriction  is  to  introduce  a  distinction  between  abstract  names  and 
concrete  names,  and  allow  the  correspondence  to  be  chosen  to  “match”  the  path  actually 
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traveled  by  the  moving  agent.  This  is  in  the  same  spirit  as  our  treatment  of  key  constants: 
we  require  that  the  interpretation  of  key  constants  as  keys  satisfy  some  sanity  conditions, 
but  we  do  not  fix  the  interpretation  of  key  constants  as  specific  keys. 

However,  abstracting  from  names  poses  obvious  difficulties.  In  particular,  when  messages 
are  sent  to  abstracted  names,  it’s  generally  not  clear  which  sets  of  messages  form  the  input 
histories  of  which  components.  If  the  system  is  very  homogeneous  (i.e.,  the  components 
whose  names  are  abstracted  all  behave  similarly,  as  do  the  servers  in  the  previous  section) 
and  the  processing  of  different  messages  by  each  component  is  sufficiently  independent,  these 
uncertainties  are  manageable,  but  in  general,  the  difficulties  are  formidable. 

5.4.3  Approximation  of  Message  Extensions 

Since  elements  of  Arbs  contain  a  set  (not  a  sequence)  of  key  constants,  the  analysis  does 
not  keep  track  of  the  order  in  which  keys  in  kcs  might  appear  in  message  extensions.  Some 
abstraction  from  the  order  of  signatures  in  the  extension  is  essential  for  the  analysis  to 
terminate,  since  extensions  can  be  of  unbounded  length  but  contain  only  a  finite  number  of 
different  keys. 


Chapter  6 


Related  and  Future  Work 


This  chapter  puts  the  work  described  in  this  thesis  in  context.  Section  6.1  looks  to  the  past, 
discussing  related  work.  Section  6.2  looks  to  the  future,  discussing  several  directions  for 
future  work. 

6.1  Related  Work 

Abstract  Interpretation 

Abstract  interpretation  is  an  extremely  general  framework  for  program  analysis  [AH87].  Our 
analysis  is  distinguished  from  pure  abstract  interpretation  by  the  use  of  symbolic  values 
to  track  relationships  between  values;  thus,  our  fixed-point  analysis  incorporates  symbolic 
computation  as  well  as  abstract  interpretation.  To  some  extent,  symbolic  values  can  be 
simulated  in  the  context  of  abstract  interpretation,  by  introducing  statically  an  abstract 
value  corresponding  to  each  symbolic  value  that  will  be  needed  in  the  course  of  the  analysis; 
this  approach  and  its  limitations  are  discussed  further  below. 

If  one  ignores  for  the  moment  our  use  of  symbolic  values,  our  analysis  is  a  form  of  abstract 
interpretation.  However,  there  is  some  qualitative  difference  between  our  analysis  and  most 
traditional  uses  of  abstract  interpretation,  which  typically  deal  with  domains  whose  structure 
mirrors  the  structure  of  the  program  being  analyzed.  This  is  the  case,  for  example,  for 
analyses  that  infer  types  for  expressions ,  or  computing  def-use  chains  between  occurrences  of 
variables ,  or  determine  whether  certain  expressions  are  constant.  Such  analyses  are  typically 
designed  to  be  incorporated  into  compilers,  so  low  computational  cost  is  essential. 

Fault-tolerance  analysis  focuses  on  properties  of  a  system’s  behavior  whose  verification 
may  require  (in  general)  high  computational  complexity.  So,  the  domains  tend  to  be  based 
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less  on  the  structure  of  the  system  and  more  on  the  structure  of  its  executions.  While 
there  is  no  sharp  distinction  here,  an  example  may  help  illustrate  the  difference.  Consider  a 
restricted  form  of  fault-tolerance  analysis  that  yields,  for  each  pair  of  components,  a  single 
ms-atom  describing  the  communication  between  those  components.  This  restricted  analysis 
has  a  static  flavor,  since  the  structure  of  the  domains  corresponds  closely  to  the  structure  of 
the  system,  but  it  is  sometimes  inadequate.  For  example,  in  the  analysis  of  FIFO  reliable 
broadcast  in  Section  4.1,  it  was  important  to  distinguish  between  the  values  and  multiplicities 
of  the  two  messages  whose  ordering  is  being  checked.  The  restricted  analysis  would  be  too 
imprecise. 

The  abstraction  mechanisms  in  our  framework  have  been  designed  to  provide  great  flexi¬ 
bility  in  the  precision  with  which  systems  are  modeled;  for  example,  when  writing  an  input- 
output  function,  one  can  choose  to  use  at  most  one  ms-atom  in  each  poset,  or  one  can  choose 
to  use  many.  In  general,  the  framework  supports  rather  than  enforces  approximations.  With 
this  flexibility  comes  a  burden  on  the  user  to  select  appropriate  approximations.  This  bur¬ 
den  seems  inevitable,  since  verification  of  asynchronous  systems  with  channels  of  unbounded 
capacity  is  undecidable  [BZ83]. 

Failure  Propagation  and  Transformation  Notation 

In  their  work  on  applying  HAZOP  and  FMECA  to  computer-based  systems,  McDermid  et 
al.  [FM93,FMNP94,MNPF95]  have  developed  an  approach  to  validation  of  fault-tolerance 
that  shares  with  our  work  the  idea  of  characterizing  each  component  by  how  it  generates 
and  propagates  “failures”  (perturbations).  Their  approach  is  embodied  in  their  failure  prop¬ 
agation  and  transformation  notation  (FPTN).  FPTN  achieves  simplicity  at  the  expense  of 
generality.  One  can  choose  for  each  system  the  relevant  kinds  of  perturbations,  but  each 
kind  of  perturbation  must  be  represented  by  a  single  boolean  value.  For  example,  one  bit 
might  indicate  omission  of  an  output;  another  bit  might  indicate  arbitrary  corruption  of  an 
output  value. 

Our  framework  is  parameterized  by  the  domains  A  Val  and  A  A  Val,  so  the  representation 
can  be  customized  for  different  application  domains.  For  example,  in  the  analysis  of  the 
cryptography-based  protocol  in  Chapter  5,  we  introduce  Arb  and  use  it  to  represent  (roughly 
speaking)  the  set  of  cryptographic  information  known  to  each  faulty  component. 
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Fault-Tolerance  as  Self-Similarity 

Fault-tolerance  as  self-similarity  [Web93]  shares  with  our  work  the  goals  of  separating  the 
specification  of  fault-tolerance  from  other  correctness  requirements  and  developing  special¬ 
ized  techniques  for  verification  of  fault-tolerance  requirements.  To  achieve  this,  Weber  adopts 
a  rigid  notion  of  fault-tolerance:  he  equates  fault-tolerance  with  fault-masking.  In  other 
words,  he  defines  a  system  to  be  fault-tolerant  iff  its  visible  behavior  in  the  presence  of  faults 
is  the  same  as  in  the  absence  of  faults.  This  is  attractive  because  fault-masking  proper¬ 
ties  can  be  expressed  in  terms  of  bisimilarity:1  a  system  masks  a  fault  iff  the  fault  causes 
transitions  only  between  bisimilar  states.  Thus,  this  approach  allows  one  to  leverage  work 
on  checking  bisimilarity  [CS96] .  This  technique  is  interesting  but  limited  in  applicability  to 
systems  in  which  faults  are  completely  masked. 

Abstraction  in  Model  Checking 

Abstractions  play  an  important  role  in  our  work.  Clarke,  Grumberg,  and  Long  studied  the 
use  of  the  abstractions  in  conjunction  with  temporal-logic  model-checking  [CGL92,CGL94], 
Their  notion  of  abstraction  corresponds  roughly  to  abstract  interpretation  and  to  our  notion 
of  abstract  values,  though  in  their  state-based  approach,  multiplicities  are  not  explicit,  so 
abstractions  are  used  only  for  (data)  values.  They  also  propose  so-called  symbolic  abstrac¬ 
tions ,  which  are  just  abbreviations  for  finite  families  of  (non-svmbolic)  abstractions.  Our 
symbolic  values  are  closer  to  the  technique  they  sketch  in  the  last  paragraph  of  [CGL94]  for 
dealing  with  infinite-state  systems. 

In  Ivurshan’s  automata-based  verification  methodology,  approximations  are  embodied 
in  reductions  between  verifications  [Kur89,Ivur94].  A  typical  use  of  a  reduction  is  to  col¬ 
lapse  multiple  states  of  an  automaton  into  a  single  state  of  a  reduced  automaton;  this  is 
analogous  to  introducing  abstract  values.  Relationships  between  concrete  values  can  be  ex¬ 
pressed  in  Kurshan’s  methodology  using  parameterized  families  of  reductions,  reminiscent  of 
Clarke,  Grumberg,  and  Long’s  so-called  symbolic  abstractions.  For  example,  to  verify  that 
a  bounded-length  queue  containing  numbers  in  [1  ..n]  does  not  drop  items,  one  can  use  a 
family  of  reductions  that  collapses  the  set  [1  ..n]  of  concrete  data  values  to  two  abstract  data 
values:  the  one  being  “focused  on”,  specified  as  a  parameter,  and  “everything  else”  [Ivur94, 
Appendix  D],  The  relationship  captured  here  is  equality  of  each  value  with  the  concrete 

1  Roughly,  states  .s  and  s'  are  bisimilar  if  the  set  of  visible  behaviors  possible  starting  from  state  .s  equals 
the  set  of  visible  behaviors  possible  starting  from  state  s'.  For  details,  see  [Mil89]. 
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value  being  focused  on.2  The  parameter  of  the  reduction  corresponds  in  our  framework  to 
use  of  a  variable  representing  the  focused-on  value.  For  problems  involving  related  values 
(e.g.,  X  and  F(X)),  the  reductions  must  introduce  an  abstract  data  value  representing  each 
such  value.  In  effect,  one  must  determine  in  advance  all  relevant  symbolic  values  (e.g.,  all 
symbolic  values  that  would  arise  during  the  fixed-point  calculation  in  our  framework)  and 
introduce  an  abstract  data  value  for  each.  Note  that  the  resulting  abstractions  are  not  mod¬ 
ular,  since  they  may  include  abstract  values  corresponding  to  symbolic  values  that  contain 
variables  local  to  different  processes. 

In  our  framework,  it  is  not  necessary  to  require  that  the  items  in  the  queue  come  from 
a  finite  set.  Using  symbolic  multiplicities,  it  is  not  even  necessary  to  require  that  the  queue 
have  bounded  length.  In  this  case,  the  input-output  function  would  need  to  incorporate 
non-trivial  abstractions,  which  would  need  to  be  verified  manually. 

An  attractive  feature  of  Clarke  and  Long’s  work  and  Ivurshan’s  work  is  that  abstractions 
(or  reductions)  are  specified  as  homomorphisms  and  applied  to  programs  (or  automata) 
automatically.  We  plan  to  look  at  mechanized  support  for  applying  abstractions  in  our 
framework. 

Verification  of  Byzantine  Agreement  Algorithms 

Our  approach  to  fault-tolerance  analysis  is  similar  in  spirit  to  the  state-exploration  technique 
of  Gong,  Lincoln,  and  Ruslibv  [GLR95].  In  both  cases,  an  automated  analysis  is  used 
to  compute  and  evaluate  the  behavior  of  the  system  separately  in  each  failure  scenario  of 
interest.  However,  the  work  described  in  [GLR95]  apparently  does  not  include  the  use  of 
any  form  of  abstraction. 

6.2  Future  Work 

This  section  describes  some  possible  extensions  of  and  variations  on  our  work. 

Inter-channel  orderings.  Because  our  representation  of  runs  contains  no  inter-channel 
orderings,  our  analysis  suffers  (in  effect)  from  the  merge  anomaly  [Kel78,Bro88].  Specifically, 

2 Although  the  method  just  described  is  one  way  to  prove  that  the  queue  doesn’t  drop  items,  it  is  ap¬ 
parently  not  the  method  Kurshan  has  in  mind,  since  in  [Kur94,  Appendix  D],  the  reduction  is  not  actually 
parameterized:  the  concrete  value  being  focused  on  is  fixed  to  be  1.  Presumably  Kurshan  has  in  mind  the 
following  method.  Let  Qredj  be  the  reduced  automaton  obtained  from  the  reduction  h-,  that  focuses  on  the 
concrete  value  i.  It  suffices  to  check  that  Qredi  does  not  drop  items,  and  that  for  each  i  G  [2 ,.n],  Qredi  is 
isomorphic  to  Qred1.  Note  that  this  method  still  requires  iterating  over  O(n)  reductions. 
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when  modeling  a  non-strict  component,  one  is  forced  to  use  a  conservative  approximation. 
One  way  to  remedy  this  is  to  adopt  at  the  abstract  level  some  analogue  of  the  approach  used 
at  the  concrete  level.  However,  this  would  lead  to  an  inefficient  analysis,  since  modeling 
each  component  by  a  set  of  functions  causes  an  explosion  in  the  number  of  combinations  of 
functions  that  must  be  considered  in  the  analysis  of  a  system. 

A  more  promising  remedy  is  to  add  inter-channel  orderings  to  the  representation  of  runs. 
This  is  one  of  the  ideas  behind  Brock  and  Ackermann’s  scenarios  [BA81,Bro83]  and  Pratt’s 
model  of  processes  [Pra82],  In  both  of  those  models,  a  process  is  a  (potentially  infinite) 
set,  each  element  of  which  represents  a  complete  (and  potentially  infinite)  behavior.  This 
contrasts  with  Kahn’s  model,  in  which  a  process  is  a  function  that  can  be  used  to  determine 
the  behavior  of  a  system  incrementally  via  a  fixed-point  calculation.  So,  neither  scenarios  nor 
Pratt’s  model  is  directly  suitable  as  the  basis  for  an  efficient  fixed-point  analysis.  However, 
a  related  approach  seems  feasible.  Roughly,  one  adds  to  the  representation  (2.29)  of  runs  a 
partial  order  on  the  ms-atoms  that  appear  in  the  run.  For  example,  a  run  could  be  a  pair 
(r ,  A),  where  r  E  Name  ->■  Hist  and  ~<E  Order({J{x,y)eNamexName{{x,y)}  x  r(y)(x)).  Input- 
output  functions  must  be  extended  to  deal  with  these  inter-cliannel  orderings;  in  particular, 
the  domain  of  input-output  functions  is  extended  with  a  partial  order  on  the  input  ms-atoms 
(from  all  sources),  and  the  range  is  extended  with  orderings  between  input  and  outputs  (i.e., 
causal  dependencies  of  outputs  on  inputs)  and  orderings  between  outputs.  The  details  of  this 
approach  remain  to  be  worked  out  (proving  soundness  of  the  analysis  apparently  becomes 
much  more  complicated). 

Integrating  symbolic  and  abstract  values.  As  discussed  in  Section  5.4.1,  the  separa¬ 
tion  between  symbolic  and  abstract  values  enforced  in  the  current  framework  is  sometimes 
awkward.  It  would  be  interesting  to  explore  ways  of  allowing  a  tighter  integration  of  the  two. 
For  example,  we  might  label  each  subterm  of  a  symbolic  value  with  an  abstract  value.  We 
should  also  allow  the  wildcard  to  appear  within  an  expression.  This  would  allow  values  like 
plus( X :  N,  _  :?) :  N,  which  represents  the  set  of  numbers  {p(A"),  p(X)  + 1}.  We  might  want  to 
refer  to  this  value  in  other  ms-atoms,  so  we  should  also  allow  a  variable  to  be  associated  with 
this  expression  as  a  whole;  thus,  we  might  use  a  value  of  the  form  (plus( X  :N,  _  :?)  as  Y)  :N, 
where  Y  represents  this  value  as  a  whole. 

Dynamic  creation  of  components.  The  frameworks  described  in  this  thesis  cannot 
directly  represent  systems  in  which  components  are  created  dynamically.  Such  systems 
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can  be  modeled  by  including  a  sufficient  number  of  idle  components  and  sending  special 
messages  to  activate  them  when  they  are  actually  created.  One  difficulty  with  this  approach 
is  that  it  may  be  difficult  to  determine  in  advance  what  is  a  “sufficient  number”.  Also, 
this  approach  is  awkward  if  the  input-output  functions  associated  with  new  components  are 
determined  dynamically  (i.e.,  when  components  are  created),  since  our  current  frameworks 
use  a  fixed  mapping  nf  from  names  to  input-output  functions.  This  dynamic  behavior  can 
be  mimicked  (albeit  awkwardly)  using  input-output  functions  that  are  (in  effect)  interpreters 
for  the  language  in  which  input-output  functions  are  written. 

Instead  of  developing  techniques  to  simulate  dynamic  component  creation  in  the  cur¬ 
rent  frameworks,  we  can  extend  the  frameworks  to  represent  dynamic  component  creation 
directly.  One  approach  is  to  extend  the  range  of  input-output  functions  to  include  compo¬ 
nent  creation  events.  A  component  creation  event  must  contain  the  input-output  function 
associated  with  the  new  component,  so  the  type  of  input-output  functions  is  of  the  form 

10  F  =  InMsgs  —r  OutMsgs  x  Set  (Name  x  10  F). 

Making  sense  of  recursive  definitions  of  this  form  requires  domain  theory,  as  opposed  to  the 
set  theory  used  in  this  thesis.  An  alternative  approach  is  to  give  an  operational  semantics, 
in  the  form  of  a  transition  system  for  some  specific  agent  language  (e.g.,  lambda  calculus 
extended  with  some  primitives  for  agent  creation  and  communication);  this  is  the  approach 
taken  for  actors  in  [AMST93]. 

Automated  support  for  abstractions.  As  mentioned  in  Section  6.1,  we  plan  to  look 
at  mechanized  support  for  applying  abstractions  in  our  framework.  This  would  reduce  the 
burden  of  proving  that  input-output  functions  represent  processes.  To  provide  automated 
support,  specific  languages  must  be  chosen  for  expressing  processes  and  input-output  func¬ 
tions.  For  the  former,  a  language  like  PROMELA  could  be  used  [Hol91];  for  the  latter,  a 
functional  subset  of  CAML.  An  abstraction  would  be  embodied  as  a  transformation  T  that 
maps  a  process  P  to  an  input-output  function  T(p )  that  represents  P  (and  incorporates  the 
abstraction).  In  general,  correctness  of  the  transformation  would  be  verified  manually;  that 
is,  one  would  prove  that,  for  all  processes  P  in  a  certain  class,  T(P)  represents  P.  The 
benefits  of  expressing  an  abstraction  as  a  transformation  are  that  the  transformation  can 
be  applied  automatically  and  the  effort  of  verifying  the  transformation  can  be  amortized  by 
applying  it  to  many  systems. 
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State-based  fault-tolerance  analysis.  The  history-based  model  developed  in  this  thesis 
makes  multiplicities  explicit  and  makes  the  entire  history  of  a  computation  available  to  the 
input-output  function  at  every  step.  This  gives  input-output  functions  fine  control  over  the 
approximations  used  to  represent  the  contents  of  the  channels.  Indeed,  this  is  the  primary 
benefit  of  using  histories.  On  the  other  hand,  a  state-based  approach  would  be  simpler 
in  some  respects,  so  it  is  interesting  to  consider  the  possibilities  for  a  state-based  fault- 
tolerance  analysis  that  can  still  cope  with  asynchronous  distributed  systems  with  unbounded 
communication  channels.  Naturally,  the  contents  of  the  channels  would  be  included  as  part  of 
the  state  of  the  system.  In  order  to  represent  indefinite  and  unbounded  multiplicities  (e.g.,  in 
the  output  of  Byzantine-faulty  components,  or  in  the  output  of  a  component  that  repeatedly 
transmits  “I’m  alive”  messages),  the  contents  of  channels  could  be  approximated  using  an 
appropriate  generalization  of  regular  expressions.  To  use  this  state-based  approach,  one  must 
still  develop  representations  of  the  components  that  deal  with  this  high-level  representation  of 
the  contents  of  channels.  Assuming  the  state  of  the  system  includes  only  the  current  contents 
of  channels — not  the  entire  history — less  information  is  available  to  these  representations 
of  the  components,  so  controlling  the  use  of  approximations  (especially  approximations  of 
multiplicities)  may  be  more  difficult. 

Applications.  To  test  and  refine  the  approach — and  the  tool  described  in  Appendix  B — 
we  must  apply  them  to  more  problems.  Possible  applications  include  efficient  algorithms  for 
asynchronous  Byzantine  Agreement  [CR93],  algorithms  for  the  certified  write-all  problem 
[KMS95,BKRS96],  secure  protocols  for  group  membership  and  reliable  broadcast  [Rei96, 
MR96],  and  cryptographic  protocols  for  fault-tolerant  moving  agents  [MvRSS96]. 
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Index  of  Symbols 


Symbol 

Page 

Description 

A 

11 

equal  by  definition 

Seq 

11 

finite  and  infinite  sequences 

CHist 

11 

history  of  concrete  messages 

CHist  =  Name  — >  Seq(CVal) 

11 

constructor  for  signature  of  functions 

((•)) 

10 

sequence 

£ 

11 

empty  sequence 

CHist 

11 

partial  order  on  CHist 

DProcess 

12 

determinate  process 

DProcess  =  CHist  — h-  CHist 

12 

monotonic  and  continuous  functions 

CRun 

12 

concrete  run 

CRun  =  Name  CHist 

step 

12 

step  function 

—  CRun 

12 

partial  order  on  CRun 

crun 

13 

concrete  run  of  a  determinate  system 

(•) 

13 

tuple 

Cham 

13 

chains  of  a  partial  order 

H 

13,  33 

length  of  a  sequence,  or  size  of  a  set 

dom 

13,  34 

domain  of  a  sequence  or  function 

-L  CRun 

14 

least  element  of  CRun 
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-L  CHist 

14 

least  element  of  CHist 

Process 

15 

(lion-determinate)  process 

Process  =  Set(IRProcess) 

IRProcess 

15 

input-restricted  process 

IRProcess  =  DProcess  x  Set  (CHist) 

cruns 

15 

concrete  runs  of  a  system 

7 Tj 

15 

project  itfi  component  of  a  tuple 

O 

15 

function  composition 

enabled, 

15 

enabledness  of  input-restricted  process 

A 

16 

conjunction  (infix  or  bullet-style) 

V 

16 

disjunction  (infix  or  bullet-style) 

Order 

19 

strict  partial  orderings 

POSet 

19 

strictly-partially-ordered  sets 

Hist 

19 

history  of  messages 

Hist  =  Name  — *■  POSet(L) 

Run 

19 

run 

Run  =  Name  — >  Hist 

L 

20 

labels 

i  =  Mul  x  Va/  x  Tag 

Val 

20 

values 

Va/  =  V fin (SVal  x  ^Fa/)  \  {0} 

R fin 

20 

finite  subsets 

\ 

20 

set  difference 

AVal 

21 

abstract  values 

Interpg# 

21 

interpretation  (of  abstract  values) 
Interp Set(S)  =  5  — >  Set(CVal) 

Con 

21 

constant 

Var 

21 

variable 

SVaR 

22 

symbolic  values  except  wildcard 

Sym 

22 

symbols  (constants  and  variables) 

U  Var 

SVal 

22 

symbolic  values 

SVal  =  SVal0  U  {_} 
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Mul 

24 

multiplicities 

Mul  =  Vfin{SVal  x  AMul)  \  {0} 

AMul 

24 

abstract  multiplicities 

IOF 

27 

input-output  functions 

IOF  =  {/  e  Hist  — >  Hist  tagUniform(f)} 

inj 

28 

injections 

=  POSet(L) 

28 

equality  on  POSet(L) 

—  Hist 

28 

equality  on  Hist 

tagUniform 

28 

output  is  uniform  WRT  tags  in  input 

-L  Run 

28 

least  element  of  Run 

-L  Hist 

28 

least  element  of  Hist 

—  Run 

28 

equality  on  Run 

interp 

34 

partial  interpretation  (of  symbols) 

34 

partial  functions 

— interp 

34 

ordering  on  interp 

onto 

34 

surjective  (onto)  funtions 

compatVal 

35 

compatibility  of  an  abstract  value  with  a  concrete  value 

compat  P0Set(L) 

35 

compatibility  of  labels  with  a  concrete  history 

ginv 

35 

generalized  inverse  of  function  g 

D  POSet(L) 

36 

meaning  of  a  partial-order  of  labels 

0** 

36 

meaning  of  a  history 

Fiof 

37 

meaning  of  an  input-output  function 

Fsys 

37 

meaning  of  a  system 

Interp 

37 

extensions  of  a  partial  interpretation  (of  symbols) 

D  Run 

37 

meaning  of  a  run 

cruns^n 

38 

finite-length  concrete  runs 

1 

39 

restriction  of  an  invariant  to  a  set  of  names 

—  POSet(L) 

43 

ordering  on  POSet(L) 

Fin  Hist 

43 

ordering  on  histories  regarded  as  inputs 

F  OutHist 

43 

ordering  on  histories  regarded  as  outputs 

F  Jiun 

43 

ordering  on  runs 

Process  F 

50 

failure-prone  process 

Process F  =  {p  E  Fail  —  Process  OK  E  dom(p)} 
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IOFf 

50 

input-output  function  for  a  failure-prone  component 
IOFf  =  {fe  Fail  —  IOF  \  OK  e  dom(f)} 

FS 

50 

failure  scenarios 

fs  OK 

50 

failure-free  failure  scenario 

cruns  f 

50 

concrete  runs  of  a  system 

stepF 

50 

step  function  for  a  system  in  a  given  failure  scenario 

Fioff 

50 

meaning  of  an  element  of  IOF F 

B 

51 

booleans 

origIndepc 

63 

sanity  condition  for  Process  fc 

consistent 

64 

sanity  condition  for  Process  Fc 

Process  Fc 

64 

failure-prone  process,  represented  using  changes 

cruns Fc 

64 

concrete  runs  of  a  system,  represented  using  changes 

Lpc 

66 

original  label  with  changes,  or  new  label 

Lfc  —  Lper  U  Lnew 

Lper 

66 

original  label  with  changes 

Lper  =  Mul  x  Val  x  AMul  x  A  Val  x  Tag 

Lnew 

66 

new  label 

Lnew  =  Mul  x  Val  x  Tag 

A  AVal 

66 

change  to  abstract  value 

A  AMul 

67 

change  to  abstract  multiplicity 

AVal 

67 

change  to  value 

AVal  =  V fin  (S  Val  x  A  AVal)  \  {0} 

AMul 

67 

change  to  multiplicity 

A  Mul  =  V fin  (S  Val  x  A  AMul)  \  {0} 

Hist  fc 

67 

history  of  messages,  represented  using  changes 
Histpc  =  Name  — >  POSet(Lpc) 

Run  pc 

67 

run  of  a  system,  represented  using  changes 

Runpc  =  Name  — >  Histpc 

IOFpc 

69 

input-output  function  for  a  failure-prone  component 

represented  using  changes 

tagUniform  pc 

69 

output  is  uniform  WRT  tags  in  input 

orig 

69 

projection  on  original  behavior 

origlndep 

69 

sanity  condition  for  IOFfc 
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borig 

69 

bijection  associated  with  orig 

run  FC 

70 

run  of  a  failure-prone  system,  represented  using  changes 

unchanged 

70 

unchanged  (for  a  poset  of  labels) 

totalOrd 

70 

totally-ordered  (a  property  of  posets) 

unchanged  Val 

70 

unchanged  (for  a  value) 

D  POSet(LFC) 

75 

meaning  of  an  element  of  POSet(LFc ) 

compatPOSet(LFCj 

75 

compatibility  of  labels  with  original  and  perturbed 

concrete  histories 

compat  AVal 

75 

compatibility  of  a  change  to  a  value 

with  original  and  perturbed  values 

0  Hist  pc 

75 

meaning  of  elements  of  Histpc 

1— IOFpc 

76 

meaning  of  elements  of  IOFpc 

' —SysFC 

76 

meaning  of  a  failure-prone  system 

represented  using  changes 

0  Run  pc 

76 

meaning  of  elements  of  Run Fc 

—  POSet(Lpc) 

78 

ordering  on  POSet(LFC ) 

—  In  Hist  pc 

78 

ordering  on  Histpc  regarded  as  inputs 

—  Out  Hist  pc 

78 

ordering  on  Histpc  regarded  as  outputs 

—  Run  fc 

78 

ordering  on  RunFc 
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CRAFT:  A  Tool  for  Fault-Tolerance 
Analysis 


The  non-perturbational  and  perturbational  analysis  frameworks  described  in  Chapter  3  are 
implemented  in  a  prototype  tool,  called  CRAFT  (Change-Relations  for  Analysis  of  Fault- 
Tolerance).  This  appendix  sketches  the  structure  and  use  of  CRAFT. 

B.l  Overview 

CRAFT  is  implemented  in  the  functional  programming  language  C AML  Light  [Ler97] ,  a  di¬ 
alect  of  Standard  ML  [MTH90].  CRAFT  provides  a  collection  of  (AML  types  and  functions 
that  implement  the  non-perturbational  and  perturbational  analysis  frameworks  described  in 
Chapter  3.  CRAFT  also  provides  a  graphical  interface  to  facilitate  entry  of  systems  and 
inspection  of  analysis  results.  This  section  gives  an  overview  of  the  use  of  CRAFT;  the 
remaining  sections  contain  more  detailed  descriptions  of  the  types  and  functions  provided 
by  CRAFT. 

To  use  CRAFT  to  analyze  a  system,  the  first  step  is  to  express  the  input-output  functions 
representing  system  components  as  CAML  functions  of  appropriate  type.  Note  that  input- 
output  functions  are  written  directly  in  the  CAML  programming  language.  This  allows 
the  full  power  of  CAAIL  and  its  libraries  to  be  used.  However,  automatically  checking 
requirements  (such  as  uniformity  with  respect  to  tags)  on  these  functions  is  harder  than  it 
would  be  for  a  more  restricted  language;  this  is  partly  why  CRAFT  leaves  enforcement  of 
such  requirements  to  the  user.  CRAFT  provides  CAML  types  corresponding  to  IOF  and 
IOFfc ■  Input-output  functions  should  have  one  of  these  two  types,  depending  on  whether 
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a  non-perturbational  or  perturbational  analysis  is  desired. 

Once  the  input-output  functions  have  been  expressed  in  CAML,  the  remaining  steps 
depend  on  whether  the  graphical  interface  is  being  used.  Without  the  graphical  interface, 
the  next  step  is  to  define  a  mapping  sys  from  names  (represented  in  CAML  as  strings)  to 
input-output  functions  that  represents  the  system.  Mappings  are  constructed  using  functions 
in  the  map  module  in  the  CAML  standard  library.  Functions  in  the  map  module  are  also  used 
to  define  failure  scenarios,  which  are  represented  in  CAML  as  mappings  from  names  to 
failures.  Given  a  system  sys  and  a  failure  scenario  fs,  functions  provided  by  CRAFT  are 
used  to  compute  a  run  r  representing  the  behavior  of  system  sys  in  that  failure  scenario. 
To  determine  whether  such  a  run  satisfies  the  fault-tolerance  requirement,  we  express  the 
fault-tolerance  requirement  as  a  CAML  function  ftr,  which  takes  a  failure  scenario  and  a 
run  as  arguments  and  returns  a  Boolean,  and  compute  ftr  fs  r.  If  this  returns  false,  a 
textual  representation  of  the  run  can  be  inspected  to  help  ascertain  the  problem.  CRAFT 
does  not  currently  provide  a  special  function  to  automatically  repeat  the  analysis  for  all 
possible  failure  scenarios  for  a  system,  but  this  is  trivial  to  implement,  since  CRAFT  does 
provide  a  function  that  returns  a  list  of  all  possible  failure  scenarios  for  a  system. 

CRAFT’s  graphical  interface  is  implemented  using  CamlTk,  a  CAML  interface  to  the  Tk 
widget  library.  The  graphical  interface  can  be  used  by  CAML-illiterates  to  access  libraries 
of  input-output  functions  already  written  in  CAML.1  A  user  clicks  (buttonl)2  on  the  canvas 
to  add  a  component  to  the  system.  This  creates  a  new  node  at  the  location  of  the  click 
and  pops  up  the  “Create  Node”  window,  which  is  used  to  specify  the  input-output  function 
associated  with  the  new  node.  Specifically,  the  “Create  Node”  window  displays  the  names  of 
all  the  input-output  functions  in  the  library  and  contains  fields  for  entering  the  parameters  of 
each  input-output  function.3  The  user  selects  one  of  those  input-output  functions  and  enters 
values  for  its  parameters  (if  any).  For  example,  if  the  function  Voter Fc  defined  in  (3.53)  is  in 
the  library,  there  would  be  fields  to  enter  its  three  parameters  (namely,  srcs ,  dest,  and  aval). 
The  user  uses  the  same  window  to  select  the  possible  failures  of  the  new  component.  Figure 
B.l  shows  the  “Create  Node”  window  for  a  library  containing  input-output  functions  similar 
to  those  used  in  the  running  example  in  Chapter  3.  If  “Arbitrary  Failure”  is  selected  as  a 
possible  failure,  the  user  enters  in  the  “Dests”  field  below  it  the  names  of  the  neighbors  of 

1For  historical  reasons,  the  representations  of  runs  and  input-output  functions  used  in  the  implementation 
of  the  graphical  interface  are  slightly  different  than  the  representations  described  in  Section  B.2. 

2By  convention,  mouse  buttons  are  numbered  from  left  to  right. 

3  The  writer  of  a  library  must  include  in  the  library  a  special  value  describing  the  names  and  types  of 
the  parameters  of  each  input-output  function.  CRAFT  uses  this  special  value  to  generate  a  window  with 
appropriate  fields  for  entering  parameters. 
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the  new  component;  if  the  new  component  suffers  an  arbitrary  failure,  it  will  send  arbitrary 
messages  to  its  neighbors. 

After  entering  a  system  by  creating  a  node  corresponding  to  each  system  component, 
the  user  can  specify  a  failure  scenario  and  execute  the  fixed-point  analysis.  When  each 
component  is  created,  failure  OK  is  associated  with  that  component;  thus,  the  default 
failure  scenario  is  fs0K.  To  change  the  failure  associated  with  a  component,  the  user  clicks 
(Control-buttonl)  on  the  corresponding  node.  This  pops  up  the  “Set  failure  status”  window, 
which  is  used  to  select  one  of  the  possible  failures  of  that  component.  Nodes  for  which  a 
failure  other  than  OK  is  selected  are  displayed  with  a  red  border  (recall  that  in  the  figures 
in  this  thesis,  such  nodes  are  displayed  with  dots  on  their  circumference).  By  repeating  this 
procedure,  the  user  can  select  any  failure  scenario  of  interest. 

When  a  failure  scenario  of  interest  has  been  selected,  the  user  selects  the  “Analyze” 
command  from  the  pull-down  menu  entitled  “Analyze”.  The  fixed-point  is  computed,  and 
(if  the  computation  terminates)  the  result  is  displayed.  If  the  computed  ms-atoms  are  not 
too  long  or  too  numerous,  they  are  displayed  directly  on  the  edges  of  the  graph,  as  in 
Figures  B.l  and  B.l.  The  ms-atoms  are  color-coded  (unfortunately,  this  is  not  apparent 
from  the  black-and-white  printouts  in  the  figures):  black  and  brown  text  are  used  for  the 
value  and  multiplicity,  respectively,  in  the  original  part  of  a  ms-atom;  red  text  is  used 
for  the  perturbation;  and  blue  text  is  used  for  new  ms-atoms.  The  edges  are  also  color- 
coded:  black  is  used  for  edges  labeled  only  with  perturbed  ms-atoms  containing  the  identity 
perturbation;  red,  for  edges  labeled  only  with  perturbed  ms-atoms  such  that  some  ms-atom 
contains  a  perturbation  other  than  the  identity;  blue,  for  edges  labeled  only  with  new  ms- 
atoms  or  perturbed  ms-atoms  containing  only  the  identity  perturbation;  and  violet,  for  edges 
labeled  with  both  new  mass  and  perturbed  ms-atoms  such  that  some  ms-atom  contains  a 
perturbation  other  than  the  identity.  If  the  textual  representation  of  the  ms-atoms  does  not 
fit  on  an  edge,  the  edge  is  still  color-coded,  but  the  ms-atom  is  elided.  The  user  can  click 
(Shift-button2)  on  an  edge  to  pop  up  a  window  showing  all  of  the  ms-atoms  on  that  edge. 

Figure  B.l  contains  a  screen-dump  of  the  result  of  the  analysis  in  the  absence  of  failures 
for  the  system  used  as  the  running  example  in  Chapter  3.  Figure  B.l  shows  the  result  of  the 
analysis  when  component  FI  suffers  a  value  failure.  Note  that  “TopA”  corresponds  to  TAy. 

CRAFT  supports  miscellaneous  other  commands:  saving  and  loading  systems,  show¬ 
ing  intermediate  results  of  the  fixed-point  calculation  (i.e.,  “single-stepping”  through  the 
calculation),  deleting  and  moving  nodes,  selecting  different  fonts  and  colors,  etc. 


0  Create  Node 
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Figure  B.l:  Window  for  entering  information  about  a  newT  component. 
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Figure  B.2:  Result  of  analysis  in  absence  of  failures. 
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Figure  B.3:  Result  of  analysis  when  component  FI  suffers  a  value  failure. 
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B.2  Type  Definitions  and  Function  Declarations 

The  translation  of  the  mathematical  definitions  into  CAML  is  generally  straightforward. 
Figure  B.4  contains  CAML  type  definitions  corresponding  to  Val  and  Mul.  Associated  with 
each  type  is  a  function  used  to  print  values  of  that  type.  Declarations  of  those  functions  are 
omitted  here. 

The  CAML  type  sval  corresponding  to  SVal  does  not  enforce  two  restrictions  on  the 
form  of  symbolic  values:  (1)  the  first  argument  to  data  constructor  Expr  should  be  a  Con 
or  Var,  not  an  Expr  or  Wild;  (2)  the  second  argument  to  Expr  should  not  contain  Wild. 
One  way  to  enforce  these  restrictions  is  to  follow  the  approach  taken  in  Section  2.2.2,  i.e. ,  to 
define  sval  in  terms  of  auxiliary  types  sym  and  sval_0,  corresponding  to  Sym  and  SValo, 
respectively.  We  would  like  these  two  auxiliary  types  to  be  subtypes  of  sval,  but  CAML 
does  not  support  sub  typing,  so  we  would  need  to  introduce  data  constructors  to  inject  sym 
into  sval_0  and  sval_0  into  sval.  These  extra  constructors  would  be  inconvenient.  A  better 
approach  would  be  to  introduce  an  abstract  data  type  with  two  operations:  a  constructor 
function  that  checks  these  two  requirements,  and  a  destructor  function  that  simply  returns 
the  symbolic  value.  With  this  approach,  one  needs  only  one  extra  constructor  for  each 
symbolic  value,  rather  than  an  extra  constructor  for  each  symbol  in  the  symbolic  value. 
For  convenience  in  development,  the  current  implementation  does  not  actually  use  such  an 
abstract  data  type,  but  it  would  be  trivial  to  do  so  using  (AML's  module  system. 

The  type  aval  corresponding  to  A  Val  is  generally  straightforward.  Note  that  the  iden¬ 
tifiers  in  the  first  argument  of  Arb  are  interpreted  as  constants  (in  KC).  The  treatment 
of  equality  for  aval  requires  some  care.  One  way  to  handle  equality  is  to  introduce,  for 
each  type,  a  function  that  tests  equality  of  two  elements  of  that  type.  However,  it  is  more 
convenient  if  we  can  instead  use  C AML’s  built-in  polymorphic  structural  equality  function. 
Structural  equality  is  simply  a  “pointwise  extension”  of  equality  on  base  types;  for  example, 
two  lists  of  integers  are  structurally  equal  iff  they  have  the  same  length  and  the  elements 
in  corresponding  positions  are  equal.  Is  structural  equality  the  desired  equality  on  aval? 
In  general,  no.  Recall  from  Chapter  5  that  the  arguments  of  Arb  are  sets,  not  sequences. 
Since  sets  are  not  a  base  type  in  CAML,  we  use  lists  as  arguments  of  Arb.4  Thus,  in  the 
desired  equality  on  aval,  the  order  of  elements  in  the  arguments  of  Arb  should  be  irrelevant. 

4CAML  does  provide  a  module  that  implements  sets  over  ordered  types  using  balanced  trees.  However, 
the  built-in  equality  function  does  not  have  the  desired  meaning  on  elements  of  that  type,  because  the  same 
set  can  be  represented  by  different  balanced  trees,  depending  on  the  history  of  operations  used  to  construct 
that  set.  So,  using  the  set  module  wouldn’t  help. 


149 


One  way  to  achieve  this  is  to  write  an  equality  function  that  deliberately  ignores  the  order. 
As  mentioned  above,  it  is  more  convenient  to  use  CAML’s  built-in  equality  function.  This 
works  provided  the  lists  are  maintained  in  a  canonical  form,  i.e. ,  sorted  (according  to  some 
total  order)  and  without  duplicates.  This  is  the  approach  taken  in  the  current  implementa¬ 
tion.  To  mechanically  enforce  this  invariant,  we  could  introduce  an  abstract  data  type  whose 
constructor  function  puts  the  given  term  into  canonical  form. 

The  type  amul  is  the  same  as  aval.  Since  CAML  does  not  support  subtyping,  the  only 
alternative  would  be  to  make  amul  an  entirely  distinct  type,  which  would  have  prevented 
some  code  re-use.  Of  course,  equating  these  two  types  introduces  the  possibility  of  input- 
output  functions  encountering  in  their  input  abstract  values  like  TopV  used  as  multiplicities. 
In  this  case,  the  input-output  function  should  simply  abort  (e.g.,  by  raising  an  exception). 

The  types  val  and  mul  are  represented  using  lists,  which  should  be  regarded  as  sets.  In 
other  words,  the  list  should  be  maintained  in  canonical  form,  as  described  above.  Also,  the 
empty  list  is  prohibited.  Again,  it  would  be  easy  to  mechanically  enforce  these  restrictions 
using  abstract  data  types. 

The  type  daval  corresponding  to  A  AVal  is  straightforward;  note  that  Full  (a)  corre¬ 
sponds  to  ftA-  Type  damul  is  the  same  as  daval,  for  the  same  reasons  that  amul  is  the 
same  as  aval.  Types  dval  and  dmul,  corresponding  to  A  Val  and  A  Mul,  respectively,  are 
represented  using  lists,  which  should  be  kept  in  canonical  form. 

In  order  to  re-use  code  for  the  perturbational  and  non-perturbational  frameworks,  we 
make  the  type  of  ms-atoms  polymorphic  in  the  type  of  “events”,  and  use  different  types 
of  events  for  the  two  frameworks.  Specifically,  making  the  tag  part  of  the  ms-atom  allows 
re-use  of  the  code  that  tests  whether  two  graphs  are  equal  up  to  renaming  of  tags.  The  types 
event  msatom  and  event_FC  msatom  correspond  to  L  and  LFC ,  respectively. 

The  type  ’e  poset  of  posets  with  events  of  type  ;e  is  represented  as  a  pair  of  a  list 
of  ms-atoms  and  an  ordering.  The  list  of  ms-atoms  is  kept  in  canonical  form  (i.e.,  sorted 
and  duplicate-free).  The  ordering  is  represented  using  the  order  module,  which  is  part  of 
CRAFT.  Internally,  the  order  module  represents  an  ordering  as  a  transitively-closed  set  of 
pairs. 

The  type  ’e  hist  of  histories  with  events  of  type  ’  e  is  represented  using  the  map  module, 
which  is  part  of  the  CAML  standard  library  and  provides  an  implementation  of  finite  maps 
with  ordered  domains  using  balanced  trees.  We  allow  partial  maps,  with  the  convention  that 
elements  not  in  the  domain  of  the  map  are  implicitly  mapped  to  the  empty  poset.  The  use  of 
finite  maps  rather  than  the  function  type  name  ->  ’  e  poset  allows  efficient  and  convenient 
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(*  component  name  *) 
type  name  ==  string; ; 

(*  identifiers  are  used  to  name  constants  and  variables.  *) 
type  identifier  ==  string; ; 

(*  SVal  *) 
type  sval  = 

I  Con  of  identifier 
I  Var  of  identifier 
I  Expr  of  sval* (sval  list) 

I  Wild 


(*  AVal  *) 
type  aval  = 

One 

ZeroOne 
Nat 
Plus 
TopV 

MsgFrom  of  name 
Data 
Msg 
Arb  of 


*) 


(*  denotes  {1}  *) 

(*  denotes  {0,1}.  printed  as  "?" 

(*  denotes  the  natural  numbers.  *) 

(*  denotes  {1,2,...}  *) 

(*  denotes  all  concrete  values  *) 

(*  used  in  analysis  of  reliable  beast  *) 
(*  used  in  analysis  of  Byz.  agreement  *) 
(*  used  in  analysis  of  Byz. 

(identifier  list  *  sval  list) 

(*  used  in  analysis  of  Byz. 


agreement  *) 
agreement.  *) 


(*  AMul  *) 

type  amul  ==  aval ; ; 

(*  Val  *) 

type  val  ==  (sval  *  aval)  list;; 
(*  Mul  *) 

type  mul  ==  (sval  *  amul)  list;; 


Figure  B.4:  CAML  type  definitions  corresponding  to  Val  and  Mul. 
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(*  Delta-AVal  *) 
type  daval  = 

I  Identity 
I  Full  of  aval 

i  i 

(*  Delta-AMul  *) 
type  damul  ==  daval; ; 

(*  Delta-Val  *) 

type  dval  ==  (sval  *  daval)  list;; 

(*  Delta-Mul  *) 

type  dmul  ==  (sval  *  damul)  list;; 

(*  abstract  event  in  non-perturbational  framework  *) 
type  event  =  mul  *  val ; ; 

(*  abstract  event  in  perturbational  framework  *) 
type  event_FC  = 

I  Pert  of  mul  *  val  *  dmul  *  dval 
I  Mew  of  mul  *  val 

i  i 

(*  tags  used  in  ms-atoms  *) 
type  tag  ==  int ; ; 

(*  ms-atoms  with  events  of  type  ;e. 

Type  (event  msatom)  corresponds  to  L; 
type  (event_FC  msatom),  to  L_{FC} .  *) 
type  ’e  msatom  ==  ;e  *  tag;; 


(*  denotes  the  identity  relation  on  CVal  *) 
(*  denotes  the  full  relation  on  aval  *) 


Figure  B.5:  CAML  type  definitions  corresponding  to  A  Val,  A  Mul,  L.  and  LFC ■ 
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iteration  over  the  non-empty  posets  in  a  history. 

The  type  ;e  run  of  runs  with  events  of  type  ’  e  is  also  represented  using  finite  maps.  By 
convention,  a  component  name  not  in  the  domain  of  the  map  is  implicitly  mapped  to  the 
empty  history. 

The  type  event  iofn  corresponding  to  IOF  is  straightforward.  The  requirement  that 
these  functions  be  uniform  with  respect  to  tags  is  not  mechanically  enforced. 

The  type  (event_FC,  ;f )  iofn_F  corresponds  to  IOFfc ,  with  type  ;f  corresponding  to 
the  set  Fail  of  possible  failures.  Since  elements  of  IOFfc  are  partial  functions,  we  represent 
them  using  a  pair,  whose  first  component  specifies  the  domain  of  the  partial  function,  and 
whose  second  component  corresponds  to  the  function  itself. 

A  system  with  events  of  type  ’e  and  failures  of  type  ;f  is  represented  by  an  element  of 
type  (;e,  ;f)  system_F,  i.e. ,  by  a  finite  mapping  from  names  to  input-output  functions. 
Type  ’f  f  ailure_scenario  is  straightforward. 

Recall  that  a  fault-tolerance  requirement  is  a  function  b  such  that  for  each  failure  scenario 
fs,  b(fs )  is  a  predicate  on  runs.  This  signature  corresponds  directly  to  the  type  ( ’  e ,  ’  f ) 
ft_req  of  fault-tolerance  requirements  for  systems  with  events  of  ’  e  and  failures  of  type  ’f . 
The  sanity  condition  that  these  functions  be  independent  of  tags  cannot  be  expressed  in  the 
CAML  type  system;  the  user  is  responsible  for  ensuring  that  this  condition  is  satisfied. 

Finally,  we  are  finished  with  the  type  definitions  and  come  to  the  function  declara¬ 
tions.  The  function  application  step_F  sys  fs  corresponds  to  step F(nf , fs).  Function 
lfp  computes  least  fixed  points  of  functions  of  type  (’e  run)  ->  (’e  run).  Function 
f ailure_scenarios  returns  a  list  containing  all  failure  scenarios  for  a  given  system.  The 
implementations  of  these  functions  are  straightforward.  The  only  non-trivial  aspect  is  that 
lfp  checks  for  termination  of  the  fixed-point  calculation  using  =Run  (defined  on  page  28),  so 
an  implementation  of  this  equality  is  needed.  The  current  implementation  of  iq  =Run  r2  is  as 
follows:  (1)  the  tags  in  each  run  are  “normalized”  to  be  a  prefix  of  the  natural  numbers;  (2) 
if  the  runs  contain  different  numbers  of  distinct  tags,  then  iq  j -Run  r2;  (3)  if  the  runs  contain 
the  same  number  n  of  distinct  tags,  then  for  each  permutation  a  of  the  natural  numbers 
0, 1, . . . ,  n,  rename  the  tags  in  ?q  according  to  a  and  check  whether  the  resulting  run  equals 
r2.  If  any  permutation  results  in  equality  in  step  3,  then  r\  =Run  r2;  otherwise,  ?q  ^Run  r2. 
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(*  poset  with  events  of  type  ;  e.  *) 

type  ’e  poset  ==  (Je  msatom)  list  *  (’e  msatom)  order _ order;; 

(*  history,  use  a  map,  not  a  function,  so  we  can  iterate  over  the 
non-empty  edges  in  normalize_tags .  *) 
type  ’e  hist  ==  (name,  ’e  poset)  map _ t;; 

(*  run  with  events  of  type  ;e.  Type  (event  graph)  corresponds  to  Run; 

type  (event_FC  graph),  to  Run_{FC}.  *) 
type  ;e  run  ==  (name,  ;e  hist)  map _ t;; 

(*  input-output  function  (no  failures),  with  events  of  type  ;e. 

Type  (event  iofn)  corresponds  to  IOF.  *) 
type  ’e  iofn  ==  (;e  hist)  ->  (’e  hist);; 

(*  input-output  function  with  failures  of  type  ’f. 

Type  ((event,  ’f)  iofn_F)  corresponds  to  I0F_F; 
type  ( (event _FC,  ;f)  iofn_F) ,  to  I0F_{FC} .  *) 
type  (;e,;f)  iofn_F  ==  (;f  list)*(;f  ->  (Je  iofn));; 

type  failure  =  OK  |  ValFail  I  ...  ; ; 

(*  system  (with  failures)  *) 

type  (Je,’f)  system_F  ==  (name,  (’e,’f)  iofn_F)  map _ t ; ; 

(*  failure  scenario  *) 

type  ’f  f ailure_scenario  ==  (name,  Jf)  map _ t ; ; 

(*  fault-tolerance  requirement  *) 

type  (;e,;f)  ft_req  ==  (;f  f ailure_scenario)  ->  (;e  run)  ->  bool;; 

(*  step  function  for  a  given  system_F  in  a  given  failure  scenario  *) 
value  step_F  :  (;e,;f)  system_F  ->  (’f  f ailure_scenario) 

->  ’e  run  ->  ’  e  run;; 

(*  least  fixed  point  *) 

value  lfp  :  ((’e  run)  ->  (’e  run))  ->  (;e  run);; 

(*  return  a  list  containing  all  failure  scenarios  for  a  system_F.  *) 
value  f ailure_scenarios  :  (;e,;f)  system_F 
->  ((;f  f ailure_scenario)  list);; 

Figure  B.6:  CAML  type  definitions  corresponding  to  Run ,  IOF ,  Run  /,(».  and  10 FF(: ■  plus 
miscellaneous  other  CAML  type  definitions  and  function  declarations. 
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B.3  Using  CRAFT 

This  section  describes  in  more  detail  how  to  use  CRAFT  without  the  graphical  interface. 

To  use  CRAFT  for  a  non-perturbational  analysis: 

1.  Define  appropriate  types  aval  and  failure. 

2.  Define  input-output  functions  representing  the  components  of  the  system.  These  func¬ 
tions  should  be  elements  of  (event,  failure)  iofn_F. 

3.  Express  the  fault-tolerance  requirement  as  a  function  of  type  (event,  failure)  ft_req. 

4.  Using  those  input-output  functions,  construct  an  element  sys  of  type  (event ,  failure) 
system_F  representing  the  system.  Functions  in  the  map  module  in  the  CAML  standard 
library  are  used  to  construct  maps. 

■5.  For  each  failure  scenario  fs  of  interest,  call  lfp  (step_F  sys  f  s)  to  compute  a  run 
r  of  type  event  run  representing  the  behavior  of  the  system  in  that  failure  scenario, 
and  compute  ftr  fs  r  to  check  whether  the  fault-tolerance  requirement  is  satisfied 
in  that  failure  scenario.  If  the  fault-tolerance  requirement  is  violated  in  some  failure 
scenarios,  the  corresponding  runs  can  be  inspected  to  help  ascertain  the  problem. 

To  use  CRAFT  for  a  perturbational  analysis,  simply  replace  event  with  event_FC  in  the 

above. 
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